Cadet Embed Base V1
manveertamber/cadet-embed-base-v1
published May 2025 · updated May 2025
Cadet Embed Base V1 is a BERT-base embedding model for dense retrieval, fine-tuned from e5-base-unsupervised using cross-encoder listwise distillation and synthetic queries.
specs
| Task | Dense Retrieval / Embedding |
| Architecture | BERT-base |
| Parameters | 0.1B |
| License | Apache-2.0 |
about this model
cadet-embed-base-v1 is an embedding model for dense passage retrieval, built on a BERT-base architecture (0.1 billion parameters) and fine-tuned from intfloat/e5-base-unsupervised. It achieves state-of-the-art retrieval effectiveness among BERT-scale embedding models, as reported in the accompanying research paper (arXiv:2505.19274). The model is released under the Apache-2.0 license.
Training Approach
The model is trained using cross-encoder listwise distillation with two teacher rerankers: RankT5-3B and BAAI/bge-reranker-v2.5-gemma2-lightweight. Training data consists of over 400,000 passages drawn from MS MARCO, DBpedia, and Wikipedia corpora, paired with purely synthetic queries generated by Llama-3.1 8B.
Synthetic Query Diversity
Queries are generated across multiple types—questions, claims, titles, keywords, and zero-shot or few-shot web queries. This diversity is shown to yield greater retrieval effectiveness than using any single query type alone. Moreover, synthetic queries offer utility comparable to human-written queries for training the model.
Effectiveness
Unlike conventional contrastive learning (InfoNCE loss), which can degrade performance in state-of-the-art models, listwise distillation consistently improves retrieval effectiveness across multiple datasets. This approach enables cadet-embed-base-v1 to reach the highest reported effectiveness among BERT-base embedding models on standard benchmarks.
best for
- ·Dense passage retrieval for question answering
- ·Retrieval-augmented generation (RAG) backends
- ·Semantic similarity search for domain-specific corpora
FAQ
Dense retrieval tasks where you need a lightweight, effective BERT-base embedding model for semantic search and ranking.
It achieves state-of-the-art effectiveness among BERT embedding models for retrieval, per the paper, due to listwise distillation and diverse synthetic queries.
Queries should be prefixed with "query:" and passages with "passage:"; outputs are normalized embeddings for dot-product scoring.
Apache-2.0.
Use the gigarouter OpenAI-compatible endpoint with your API key, following the standard embedding API format.
We're benchmarking and onboarding Cadet Embed Base V1 as a hosted, OpenAI-compatible API. Sign in for free credit and be ready when it lands, or tell us you want it and we'll prioritize it.