Bilingual Embedding Base

Lajavaness/bilingual-embedding-base

published Jun 2024 · updated Nov 2024

Bilingual Embedding Base is a sentence embedding model that encodes French and English sentences into a 1024-dimensional vector space for semantic search and text clustering.

est. price

~$0.008

/ 1M tokens · estimated, set at launch

API providers

downloads / mo

1.1K

license

apache-2.0

specs

Task	Sentence Embedding
Architecture	XLM-RoBERTa with SentenceTransformer pooling
Languages	French, English
Max Sequence Length	512 tokens
Embedding Dimension	1024

about this model

Lajavaness/bilingual-embedding-base is an embedding model that encodes French and English sentences into a shared 1024-dimensional vector space using a custom BilingualModel architecture built on XLM-RoBERTa from SentenceTransformers. The model applies mean pooling over token embeddings followed by L2 normalization, producing fixed-size vectors suitable for semantic similarity, clustering, and retrieval tasks.

Architecture and Training

The model is based on XLM-RoBERTa with a maximum sequence length of 512 tokens. Training proceeds in three stages:

Stage 1 – NLI training: Multi-Negative Ranking Loss on combined English (SNLI) and French (XNLI) datasets to learn sentence semantics.
Stage 3 – STS fine-tuning: Siamese BERT networks with cosine-similarity loss on English and French STS Benchmark data.
Stage 4 – Augmented SBERT: Data augmentation using cross-encoder labels (see Augmented SBERT), improving bi-encoder performance on pairwise scoring tasks.

Benchmark Performance (MTEB French tasks)

Task	Metric	Score
AlloProfClusteringP2P	v_measure	64.71
AlloProfClusteringS2S	v_measure	45.57
AlloprofReranking	MAP	70.46
AlloprofReranking	MRR	71.61
AlloprofRetrieval	MAP@10	35.52
AlloprofRetrieval	MRR@10	35.52

These results demonstrate effective cross-lingual understanding for both English and French in retrieval and clustering contexts.

best for

·Cross-lingual semantic search between French and English
·Text clustering of bilingual document collections
·Semantic textual similarity for French-English sentence pairs

FAQ

What is the embedding dimension of this model?

The model outputs 1024-dimensional vectors after SentenceTransformer pooling.

What languages does this model support?

It supports French and English bilingual text.

What is the maximum input length?

The model accepts up to 512 tokens per input.

How do I call this model via the gigarouter API?

Use the gigarouter OpenAI-compatible endpoint with your API key, specifying the model name Lajavaness/bilingual-embedding-base.

What license is this model released under?

The model card does not specify a license.

not yet live

We're benchmarking and onboarding Bilingual Embedding Base as a hosted, OpenAI-compatible API. Sign in for free credit and be ready when it lands, or tell us you want it and we'll prioritize it.

related embeddings models

compare all →

nomic-embed-text-v1.5