MMLW E5 Large
sdadas/mmlw-e5-large
published Nov 2023 · updated Feb 2026
MMLW E5 Large is a Polish text embedding model that transforms texts into 1024-dimensional vectors for tasks like semantic similarity, clustering, and information retrieval.
specs
| Task | Text Embedding |
| Architecture | Distilled from multilingual E5, using BGE teacher models |
| Parameters | 335M |
| License | MIT |
about this model
sdadas/mmlw-e5-large is an embedding model that transforms Polish text into 1024-dimensional vectors for tasks such as semantic similarity, clustering, and information retrieval.
Trained via multilingual knowledge distillation (Reimers & Gurevych, EMNLP 2020), the model was initialized from a multilingual E5 checkpoint and distilled on 60 million Polish-English text pairs using English FlagEmbeddings (BGE) as teacher models. The training infrastructure was provided by the Gdańsk University of Technology TASK center with A100 GPUs.
When encoding, queries must be prefixed with query: and passages with passage: . This prefix convention aligns with the model's training and is required for correct retrieval performance.
Evaluation Results
On the Polish Massive Text Embedding Benchmark (MTEB), the model achieves an average score of 61.17. On the Polish Information Retrieval Benchmark (PIRB), the NDCG@10 is 56.09. Per-task MTEB results include:
| Task | Metric | Score |
|---|---|---|
| 8TagsClustering | v_measure | 30.62 |
| AllegroReviews (Classification) | accuracy | 37.68 |
| AllegroReviews (Classification) | f1 | 34.19 |
| ArguAna-PL (Retrieval) | NDCG@10 | 63.25 |
| CBD (Classification) | accuracy | 66.15 |
| CDSC-E (PairClassification) | cos_sim_accuracy |
best for
- ·Polish semantic search and information retrieval
- ·Polish text clustering and classification
- ·Polish question answering and passage ranking
FAQ
Queries must be prefixed with "query: " and passages with "passage: " before encoding.
It outputs 1024-dimensional vectors.
It achieves an average score of 61.17 on the Polish MTEB and an NDCG@10 of 56.09 on the Polish Information Retrieval Benchmark.
The model is released under the MIT license.
Use the gigarouter OpenAI-compatible endpoint with your API key, sending queries with the "query: " prefix and passages with the "passage: " prefix.
We're benchmarking and onboarding MMLW E5 Large as a hosted, OpenAI-compatible API. Sign in for free credit and be ready when it lands, or tell us you want it and we'll prioritize it.