skip to content
gigarouter gigarouter
models / embeddings · coming soon

ColBERT Zero supervised

lightonai/ColBERT-Zero-supervised

published Feb 2026 · updated Feb 2026

A popular open embeddings model, with 84 downloads a month. gigarouter benchmarks and hosts it as an OpenAI-compatible API.

est. price
~$0.008
/ 1M tokens · estimated, set at launch
API providers
0
downloads / mo
84
license
apache-2.0

about this model

ColBERT-Zero-supervised is a multi-vector embedding model (ColBERT‑style) that has undergone supervised contrastive fine‑tuning as the second phase of the ColBERT‑Zero training pipeline. It is built on the ModernBERT‑base architecture and trained exclusively on public data (the Nomic‑embed mixture). The ColBERT‑Zero family demonstrates that performing contrastive pre‑training directly in the multi‑vector setting unlocks significantly higher performance than the standard approach of bolting a knowledge‑distillation step onto a dense model. The full three‑phase ColBERT‑Zero model achieves a new state‑of‑the‑art for models under 150M parameters: 55.43 nDCG@10 on the BEIR benchmark, outperforming GTE‑ModernColBERT and its base model GTE‑ModernBERT despite the latter using proprietary data.

Controlled Comparison

The training was deliberately designed to isolate the impact of multi‑vector pre‑training. By using the same public data, the same ModernBERT base, and the same pipeline as the dense ModernBERT‑embed model, the only variable is the contrastive objective. The dense baseline scores 52.89 nDCG@10 on BEIR; ColBERT‑Zero closes a 2.4‑point data quality gap to reach 55.43.

Three‑Phase Pipeline

ColBERT‑Zero‑supervised represents the output of Phase 2 (supervised contrastive fine‑tuning with mined hard negatives). The full pipeline includes:

  • Phase 1 – Unsupervised contrastive pre‑training with effective batch sizes of ~16k via GradCache and cross‑GPU gathering.
  • Phase 2 – Supervised fine‑tuning on Nomic’s supervised data with mined hard negatives.
  • Phase 3 – Knowledge distillation from a Gemma‑based teacher using the MaxSim operator.

Key Findings

Supervised contrastive fine‑tuning followed by distillation (the combination that includes this checkpoint) achieves 55.12 nDCG@10 – 99.4% of the full model’s performance at roughly 10% of the compute cost (~40 vs. ~408 GH200‑hours). Prompt alignment is critical: mismatched prompts can quietly degrade performance by over 0.8 points.

This supervised checkpoint is released for researchers studying the incremental impact of each training phase and prompt alignment.

not yet live

We're benchmarking and onboarding ColBERT Zero supervised as a hosted, OpenAI-compatible API. Sign in for free credit and be ready when it lands, or tell us you want it and we'll prioritize it.

related embeddings models

compare all →