MMLW E5 Large

sdadas/mmlw-e5-large

published Nov 2023 · updated Feb 2026

MMLW E5 Large is a Polish text embedding model that transforms texts into 1024-dimensional vectors for tasks like semantic similarity, clustering, and information retrieval.

est. price

~$0.008

/ 1M tokens · estimated, set at launch

API providers

downloads / mo

5.2K

license

apache-2.0

specs

Task	Text Embedding
Architecture	Distilled from multilingual E5, using BGE teacher models
Parameters	335M
License	MIT

about this model

sdadas/mmlw-e5-large is an embedding model that transforms Polish text into 1024-dimensional vectors for tasks such as semantic similarity, clustering, and information retrieval.

Trained via multilingual knowledge distillation (Reimers & Gurevych, EMNLP 2020), the model was initialized from a multilingual E5 checkpoint and distilled on 60 million Polish-English text pairs using English FlagEmbeddings (BGE) as teacher models. The training infrastructure was provided by the Gdańsk University of Technology TASK center with A100 GPUs.

When encoding, queries must be prefixed with query: and passages with passage: . This prefix convention aligns with the model's training and is required for correct retrieval performance.

Evaluation Results

On the Polish Massive Text Embedding Benchmark (MTEB), the model achieves an average score of 61.17. On the Polish Information Retrieval Benchmark (PIRB), the NDCG@10 is 56.09. Per-task MTEB results include:

Task	Metric	Score
8TagsClustering	v_measure	30.62
AllegroReviews (Classification)	accuracy	37.68
AllegroReviews (Classification)	f1	34.19
ArguAna-PL (Retrieval)	NDCG@10	63.25
CBD (Classification)	accuracy	66.15
CDSC-E (PairClassification)	cos_sim_accuracy

best for

·Polish semantic search and information retrieval
·Polish text clustering and classification
·Polish question answering and passage ranking

FAQ

What is the input format for this model?

Queries must be prefixed with "query: " and passages with "passage: " before encoding.

What embedding dimension does the model output?

It outputs 1024-dimensional vectors.

How does this model perform on Polish benchmarks?

It achieves an average score of 61.17 on the Polish MTEB and an NDCG@10 of 56.09 on the Polish Information Retrieval Benchmark.

What license is the model released under?

The model is released under the MIT license.

How can I call this model via the gigarouter API?

Use the gigarouter OpenAI-compatible endpoint with your API key, sending queries with the "query: " prefix and passages with the "passage: " prefix.

not yet live

We're benchmarking and onboarding MMLW E5 Large as a hosted, OpenAI-compatible API. Sign in for free credit and be ready when it lands, or tell us you want it and we'll prioritize it.

related embeddings models

compare all →

nomic-embed-text-v1.5