MMLW E5 Base
sdadas/mmlw-e5-base
published Nov 2023 · updated Feb 2026
MMLW E5 Base is a Polish text embedding model that transforms texts into 768-dimensional vectors for tasks like semantic similarity, clustering, and information retrieval.
specs
| Task | Text Embedding |
| Architecture | Distilled from multilingual E5 with BGE teacher |
| Parameters | Not specified |
| License | MIT (teacher model) |
about this model
MMLW-e5-base is a Polish neural text encoder that transforms texts into 768-dimensional embeddings for tasks such as semantic similarity, clustering, and information retrieval. It is a distilled model initialized from the multilingual E5 checkpoint and further trained using multilingual knowledge distillation on 60 million Polish-English text pairs, with BAAI/bge-base-en as the teacher model. The distillation method follows the approach described in Reimers & Gurevych (EMNLP 2020).
The model requires specific prefixes: queries must be prefixed with "query: " and passages with "passage: ". It can also serve as a base for further fine-tuning.
Benchmark Results
- Polish MTEB – Average Score of 59.71 (see MTEB Leaderboard).
- Polish Information Retrieval Benchmark (PIRB) – NDCG@10 of 53.56 (see PIRB Leaderboard).
The model was trained with A100 GPU cluster support from the TASK center at Gdansk University of Technology.
best for
- ·Polish semantic similarity and clustering
- ·Polish information retrieval and search
- ·Polish text classification fine-tuning
FAQ
It outputs 768-dimensional vectors.
Queries must be prefixed with "query: " and passages with "passage: ".
It achieves an average score of 59.71 on the Polish MTEB.
The teacher model BAAI/bge-base-en is released under the MIT license.
Use the gigarouter OpenAI-compatible endpoint with an API key, following the required query/passage prefixes.
We're benchmarking and onboarding MMLW E5 Base as a hosted, OpenAI-compatible API. Sign in for free credit and be ready when it lands, or tell us you want it and we'll prioritize it.