Qwen3 Embedding 4B
boboliu/Qwen3-Embedding-4B-W4A16-G128
published Jun 2025 · updated Jun 2025
Qwen3 Embedding 4B is a GPTQ-quantized text embedding model that generates dense vector representations for multilingual text, supporting retrieval, classification, clustering, and ranking.
specs
| Task | Text Embedding |
| Architecture | Qwen3 Transformer |
| Parameters | 4B (W4A16 quantized) |
| License | Apache 2.0 |
about this model
boboliu/Qwen3-Embedding-4B-W4A16-G128 is a text embedding model, a GPTQ 4-bit quantized variant of the Qwen3-Embedding-4B (Apache 2.0), designed to generate dense vector representations for retrieval, classification, clustering, reranking, and bitext mining. It supports a 32,768-token context length, flexible embedding dimensions via Matryoshka Representation Learning (MRL), and user-defined instructions for task-specific optimization—using custom instructions typically yields 1–5% improvement.
The quantized model reduces GPU VRAM usage from 17,430 MB to 11,000 MB (without Flash Attention 2) while incurring a mean performance loss of approximately 0.72% on the C-MTEB benchmark.
C-MTEB Evaluation (multilingual)
| Model | Params | Mean | Class. | Clust. | Pair Class. | Rerank. | Retr. | STS |
|---|---|---|---|---|---|---|---|---|
| multilingual-e5-large-instruct | 0.6B | 58.08 | 69.80 | 48.23 | 64.52 | 57.45 | 63.65 | 45.81 |
| bge-multilingual-gemma2 | 9B | 67.64 | 75.31 | 59.30 | 86.67 | 68.28 | 73.73 | 55.19 |
| gte-Qwen2-1.5B-instruct | 1.5B | 67.12 | 72.53 | 54.61 | 79.50 | 68.21 | 71.86 | 60.05 |
| gte-Qwen2-7B-instruct | 7.6B | 71.62 | 75.77 | 66.06 | 81.16 | 69.24 | 75.70 | 65.20 |
| ritrieve_zh_v1 | 0.3B | 72.71 | 76.88 | 66.50 | 85.98 | 72.86 | 76.97 | 63.92 |
| Qwen3-Embedding-4B | 4B | 72.27 | 75.46 | 77.89 | 83.34 | 66.05 | 77.03 | 61.26 |
| This model | 4B-W4A16 | 71.75 | 75.43 | 77.51 | 83.04 | 65.73 | 76.15 | 60.47 |
The parent Qwen3-Embedding family includes an 8B variant that achieved No. 1 on the MTEB multilingual leaderboard (score 70.58, June 2025). This quantized 4B model maintains strong performance across over 100 languages with a compact memory footprint suitable for cost-efficient deployment.
best for
- ·Multilingual text retrieval across 100+ languages
- ·Document and text classification
- ·Code retrieval and search
- ·Bitext mining and text clustering
FAQ
Accepts single strings or pairs of strings (for similarity) and returns a vector embedding of dimension 2560 (or any dimension using Matryoshka Representation Learning).
It uses W4A16 quantization (4-bit weights, 16-bit activations) reducing VRAM from ~17.4GB to ~11GB with only ~0.72% performance loss on C-MTEB.
This model achieves a mean score of 71.75 on C-MTEB (overall tasks), compared to 72.27 for the original 4B model.
Yes, the model supports instruction-aware embeddings. Using task-specific instructions (in English) can yield 1% to 5% improvement on benchmarks.
Use the OpenAI-compatible endpoint with your API key, passing an input text and specifying the model as boboliu/Qwen3-Embedding-4B-W4A16-G128.
We're benchmarking and onboarding Qwen3 Embedding 4B as a hosted, OpenAI-compatible API. Sign in for free credit and be ready when it lands, or tell us you want it and we'll prioritize it.