Embedder 100P
deepfile/embedder-100p
published Jul 2023 · updated Dec 2024
Embedder 100P is a bilingual English-German text embedding model that maps sentences and paragraphs to 768-dimensional dense vectors for semantic search and clustering.
specs
| Task | Text Embedding |
| Architecture | XLM-RoBERTa with mean pooling |
| Output Dimensions | 768 |
| Max Sequence Length | 384 tokens |
| Training Data | >20 GiB of German text with knowledge distillation for bilingual (English/German) capability |
about this model
embedder-100p is a bilingual (English and German) embedding model that maps sentences and paragraphs to a 768-dimensional dense vector space for tasks such as clustering and semantic search. It is a bi-encoder based on the ms-marco sentence-transformers architecture, trained on over 20 GiB of German text and refined through knowledge distillation to support both English and German inputs.
Architecture
The model uses an XLMRoberta transformer with a maximum sequence length of 384 tokens. It applies mean pooling over token embeddings to produce the final 768-dimensional output vector. The architecture is:
- Transformer: XLMRobertaModel (max_seq_length: 384, do_lower_case: False)
- Pooling: mean pooling (word_embedding_dimension: 768)
Training Details
The model was trained for 20 epochs using MSELoss with a batch size of 16, a learning rate of 7e-6, and a warmup of 5,000 steps. Training data was loaded via a DataLoader with 231,230 steps per epoch.
Evaluation
The model was evaluated on the MTEB benchmark. Specific scores are not provided in the model card.
Key Strengths
- Bilingual support for English and German text
- 768-dimensional embeddings suitable for semantic search and clustering
- Trained on over 20 GiB of German text with knowledge distillation
best for
- ·Bilingual English-German semantic search
- ·Clustering multilingual text documents
- ·Sentence similarity and retrieval in mixed-language corpora
FAQ
It is best for semantic textual similarity, clustering, and retrieval tasks involving English and German text.
It accepts sentences or paragraphs as plain text; internally it tokenizes with a maximum of 384 tokens per input.
Use the OpenAI-compatible endpoint with your gigarouter API key, sending a request with the input text to the embeddings endpoint.
The model card does not specify a license; please check the Hugging Face repository for any licensing details.
The model card does not report parameter count; the base architecture is XLM-RoBERTa.
We're benchmarking and onboarding Embedder 100P as a hosted, OpenAI-compatible API. Sign in for free credit and be ready when it lands, or tell us you want it and we'll prioritize it.