Cadet Embed Base V1

manveertamber/cadet-embed-base-v1

published May 2025 · updated May 2025

Cadet Embed Base V1 is a BERT-base embedding model for dense retrieval, fine-tuned from e5-base-unsupervised using cross-encoder listwise distillation and synthetic queries.

est. price

~$0.008

/ 1M tokens · estimated, set at launch

API providers

downloads / mo

license

apache-2.0

specs

Task	Dense Retrieval / Embedding
Architecture	BERT-base
Parameters	0.1B
License	Apache-2.0

about this model

cadet-embed-base-v1 is an embedding model for dense passage retrieval, built on a BERT-base architecture (0.1 billion parameters) and fine-tuned from intfloat/e5-base-unsupervised. It achieves state-of-the-art retrieval effectiveness among BERT-scale embedding models, as reported in the accompanying research paper (arXiv:2505.19274). The model is released under the Apache-2.0 license.

Training Approach

The model is trained using cross-encoder listwise distillation with two teacher rerankers: RankT5-3B and BAAI/bge-reranker-v2.5-gemma2-lightweight. Training data consists of over 400,000 passages drawn from MS MARCO, DBpedia, and Wikipedia corpora, paired with purely synthetic queries generated by Llama-3.1 8B.

Synthetic Query Diversity

Queries are generated across multiple types—questions, claims, titles, keywords, and zero-shot or few-shot web queries. This diversity is shown to yield greater retrieval effectiveness than using any single query type alone. Moreover, synthetic queries offer utility comparable to human-written queries for training the model.

Effectiveness

Unlike conventional contrastive learning (InfoNCE loss), which can degrade performance in state-of-the-art models, listwise distillation consistently improves retrieval effectiveness across multiple datasets. This approach enables cadet-embed-base-v1 to reach the highest reported effectiveness among BERT-base embedding models on standard benchmarks.

best for

·Dense passage retrieval for question answering
·Retrieval-augmented generation (RAG) backends
·Semantic similarity search for domain-specific corpora

FAQ

What is this model best for?

Dense retrieval tasks where you need a lightweight, effective BERT-base embedding model for semantic search and ranking.

How does it compare to other BERT embedding models?

It achieves state-of-the-art effectiveness among BERT embedding models for retrieval, per the paper, due to listwise distillation and diverse synthetic queries.

What input format does the model expect?

Queries should be prefixed with "query:" and passages with "passage:"; outputs are normalized embeddings for dot-product scoring.

What is the license?

Apache-2.0.

How can I call this model via API?

Use the gigarouter OpenAI-compatible endpoint with your API key, following the standard embedding API format.

not yet live

We're benchmarking and onboarding Cadet Embed Base V1 as a hosted, OpenAI-compatible API. Sign in for free credit and be ready when it lands, or tell us you want it and we'll prioritize it.

related embeddings models

compare all →

nomic-embed-text-v1.5