Bilingual Embedding Base
Lajavaness/bilingual-embedding-base
published Jun 2024 · updated Nov 2024
Bilingual Embedding Base is a sentence embedding model that encodes French and English sentences into a 1024-dimensional vector space for semantic search and text clustering.
specs
| Task | Sentence Embedding |
| Architecture | XLM-RoBERTa with SentenceTransformer pooling |
| Languages | French, English |
| Max Sequence Length | 512 tokens |
| Embedding Dimension | 1024 |
about this model
Lajavaness/bilingual-embedding-base is an embedding model that encodes French and English sentences into a shared 1024-dimensional vector space using a custom BilingualModel architecture built on XLM-RoBERTa from SentenceTransformers. The model applies mean pooling over token embeddings followed by L2 normalization, producing fixed-size vectors suitable for semantic similarity, clustering, and retrieval tasks.
Architecture and Training
The model is based on XLM-RoBERTa with a maximum sequence length of 512 tokens. Training proceeds in three stages:
- Stage 1 – NLI training: Multi-Negative Ranking Loss on combined English (SNLI) and French (XNLI) datasets to learn sentence semantics.
- Stage 3 – STS fine-tuning: Siamese BERT networks with cosine-similarity loss on English and French STS Benchmark data.
- Stage 4 – Augmented SBERT: Data augmentation using cross-encoder labels (see Augmented SBERT), improving bi-encoder performance on pairwise scoring tasks.
Benchmark Performance (MTEB French tasks)
| Task | Metric | Score |
|---|---|---|
| AlloProfClusteringP2P | v_measure | 64.71 |
| AlloProfClusteringS2S | v_measure | 45.57 |
| AlloprofReranking | MAP | 70.46 |
| AlloprofReranking | MRR | 71.61 |
| AlloprofRetrieval | MAP@10 | 35.52 |
| AlloprofRetrieval | MRR@10 | 35.52 |
These results demonstrate effective cross-lingual understanding for both English and French in retrieval and clustering contexts.
best for
- ·Cross-lingual semantic search between French and English
- ·Text clustering of bilingual document collections
- ·Semantic textual similarity for French-English sentence pairs
FAQ
The model outputs 1024-dimensional vectors after SentenceTransformer pooling.
It supports French and English bilingual text.
The model accepts up to 512 tokens per input.
Use the gigarouter OpenAI-compatible endpoint with your API key, specifying the model name Lajavaness/bilingual-embedding-base.
The model card does not specify a license.
We're benchmarking and onboarding Bilingual Embedding Base as a hosted, OpenAI-compatible API. Sign in for free credit and be ready when it lands, or tell us you want it and we'll prioritize it.