Udever BLOOM 1.1B
izhx/udever-bloom-1b1
published Oct 2023 · updated Nov 2023
Udever BLOOM 1.1B is an embed model that generates universal embeddings for multiple natural and programming languages, fine-tuned from BLOOM-1B1 via BitFit on MS MARCO, SNLI, and MultiNLI.
specs
| Task | Text Embedding |
| Architecture | Decoder-only Transformer (BLOOM) |
| Parameters | 1.1 billion |
| License | bigscience-bloom-rail-1.0 |
| Training Data | MS MARCO Passage Ranking, SNLI, MultiNLI |
about this model
Udever-bloom-1b1 is a universal embedding model finetuned from bigscience/bloom-1b1 via BitFit on MS MARCO Passage Ranking, SNLI, and MultiNLI data, designed to generate high-quality embeddings across tasks, natural languages, and programming languages. It is part of the Udever family, which extends the Language Models are Universal Embedders approach (presented at the XLLM Workshop, ACL 2025). The model uses a decoder-only Transformer architecture with contrastive loss and hard negatives, and training code is publicly available.
Benchmark Performance
On the Massive Text Embedding Benchmark (MTEB, 56 datasets), Udever-bloom-1b1 achieves an average score of 58.28, with strong results in classification (70.18), pair classification (83.11), and STS (81.52). On CodeSearchNet, it averages 80.90 across six programming languages (Go, Ruby, Python, Java, JavaScript, PHP). In Chinese multi-domain retrieval (Multi-cpr), it obtains an MRR@10 of 0.244 (E-commerce), 0.208 (Entertainment video), and 0.241 (Medical). The model handles languages and tasks not seen during fine-tuning, as demonstrated in the paper’s zero-shot evaluations.
| Benchmark | Metric | Udever-bloom-1b1 | Reference (OpenAI ada-002) |
|---|---|---|---|
| MTEB | Average | 58.28 | 60.99 |
| CodeSearchNet | Avg. MRR | 80.90 | – |
| Multi-cpr (E-com) | MRR@10 | 0.244 | 0.183 |
Additional checkpoints (560m, 3b, 7b1) are available on Hugging Face and ModelScope. The underlying BLOOM base model supports 46 languages including code. Per-dataset MTEB metrics are listed on the model’s Hugging Face page.
best for
- ·Multilingual semantic search and retrieval
- ·Code-to-code search and code retrieval
- ·Text classification and sentence similarity (STS)
- ·Cross-lingual embedding tasks
FAQ
It is best for generating universal embeddings across tasks (retrieval, classification, reranking, STS) and languages, including natural and programming languages.
It is a decoder-only Transformer based on bigscience/bloom-1b1, fine-tuned with BitFit.
The model is derived from BLOOM-1B1 and uses the bigscience-bloom-rail-1.0 license.
Use the gigarouter OpenAI-compatible endpoint with your API key. The model expects input with special tokens [BOQ]/[EOQ] for queries and [BOD]/[EOD] for documents.
It supports 46 languages (same as BLOOM-1B1) including English, Chinese, code, and many others. See the BLOOM training data for the full list.
We're benchmarking and onboarding Udever BLOOM 1.1B as a hosted, OpenAI-compatible API. Sign in for free credit and be ready when it lands, or tell us you want it and we'll prioritize it.