skip to content
gigarouter gigarouter
models / embeddings · coming soon

Bilingual Embedding Base

Lajavaness/bilingual-embedding-base

published Jun 2024 · updated Nov 2024

Bilingual Embedding Base is a sentence embedding model that encodes French and English sentences into a 1024-dimensional vector space for semantic search and text clustering.

est. price
~$0.008
/ 1M tokens · estimated, set at launch
API providers
0
downloads / mo
1.1K
license
apache-2.0

specs

TaskSentence Embedding
ArchitectureXLM-RoBERTa with SentenceTransformer pooling
LanguagesFrench, English
Max Sequence Length512 tokens
Embedding Dimension1024

about this model

Lajavaness/bilingual-embedding-base is an embedding model that encodes French and English sentences into a shared 1024-dimensional vector space using a custom BilingualModel architecture built on XLM-RoBERTa from SentenceTransformers. The model applies mean pooling over token embeddings followed by L2 normalization, producing fixed-size vectors suitable for semantic similarity, clustering, and retrieval tasks.

Architecture and Training

The model is based on XLM-RoBERTa with a maximum sequence length of 512 tokens. Training proceeds in three stages:

  • Stage 1 – NLI training: Multi-Negative Ranking Loss on combined English (SNLI) and French (XNLI) datasets to learn sentence semantics.
  • Stage 3 – STS fine-tuning: Siamese BERT networks with cosine-similarity loss on English and French STS Benchmark data.
  • Stage 4 – Augmented SBERT: Data augmentation using cross-encoder labels (see Augmented SBERT), improving bi-encoder performance on pairwise scoring tasks.

Benchmark Performance (MTEB French tasks)

Task Metric Score
AlloProfClusteringP2P v_measure 64.71
AlloProfClusteringS2S v_measure 45.57
AlloprofReranking MAP 70.46
AlloprofReranking MRR 71.61
AlloprofRetrieval MAP@10 35.52
AlloprofRetrieval MRR@10 35.52

These results demonstrate effective cross-lingual understanding for both English and French in retrieval and clustering contexts.

best for

FAQ

What is the embedding dimension of this model?

The model outputs 1024-dimensional vectors after SentenceTransformer pooling.

What languages does this model support?

It supports French and English bilingual text.

What is the maximum input length?

The model accepts up to 512 tokens per input.

How do I call this model via the gigarouter API?

Use the gigarouter OpenAI-compatible endpoint with your API key, specifying the model name Lajavaness/bilingual-embedding-base.

What license is this model released under?

The model card does not specify a license.

not yet live

We're benchmarking and onboarding Bilingual Embedding Base as a hosted, OpenAI-compatible API. Sign in for free credit and be ready when it lands, or tell us you want it and we'll prioritize it.

related embeddings models

compare all →