Embedder 100P

deepfile/embedder-100p

published Jul 2023 · updated Dec 2024

Embedder 100P is a bilingual English-German text embedding model that maps sentences and paragraphs to 768-dimensional dense vectors for semantic search and clustering.

est. price

~$0.008

/ 1M tokens · estimated, set at launch

API providers

downloads / mo

223

specs

Task	Text Embedding
Architecture	XLM-RoBERTa with mean pooling
Output Dimensions	768
Max Sequence Length	384 tokens
Training Data	>20 GiB of German text with knowledge distillation for bilingual (English/German) capability

about this model

embedder-100p is a bilingual (English and German) embedding model that maps sentences and paragraphs to a 768-dimensional dense vector space for tasks such as clustering and semantic search. It is a bi-encoder based on the ms-marco sentence-transformers architecture, trained on over 20 GiB of German text and refined through knowledge distillation to support both English and German inputs.

Architecture

The model uses an XLMRoberta transformer with a maximum sequence length of 384 tokens. It applies mean pooling over token embeddings to produce the final 768-dimensional output vector. The architecture is:

Transformer: XLMRobertaModel (max_seq_length: 384, do_lower_case: False)
Pooling: mean pooling (word_embedding_dimension: 768)

Training Details

The model was trained for 20 epochs using MSELoss with a batch size of 16, a learning rate of 7e-6, and a warmup of 5,000 steps. Training data was loaded via a DataLoader with 231,230 steps per epoch.

Evaluation

The model was evaluated on the MTEB benchmark. Specific scores are not provided in the model card.

Key Strengths

Bilingual support for English and German text
768-dimensional embeddings suitable for semantic search and clustering
Trained on over 20 GiB of German text with knowledge distillation

best for

·Bilingual English-German semantic search
·Clustering multilingual text documents
·Sentence similarity and retrieval in mixed-language corpora

FAQ

What is this model best used for?

It is best for semantic textual similarity, clustering, and retrieval tasks involving English and German text.

What input format does it accept?

It accepts sentences or paragraphs as plain text; internally it tokenizes with a maximum of 384 tokens per input.

How do I call this model via the gigarouter API?

Use the OpenAI-compatible endpoint with your gigarouter API key, sending a request with the input text to the embeddings endpoint.

Is the model license open-source?

The model card does not specify a license; please check the Hugging Face repository for any licensing details.

How many parameters does it have?

The model card does not report parameter count; the base architecture is XLM-RoBERTa.

not yet live

We're benchmarking and onboarding Embedder 100P as a hosted, OpenAI-compatible API. Sign in for free credit and be ready when it lands, or tell us you want it and we'll prioritize it.

related embeddings models

compare all →

nomic-embed-text-v1.5