Qwen3 Embedding 4B

boboliu/Qwen3-Embedding-4B-W4A16-G128

published Jun 2025 · updated Jun 2025

Qwen3 Embedding 4B is a GPTQ-quantized text embedding model that generates dense vector representations for multilingual text, supporting retrieval, classification, clustering, and ranking.

est. price

~$0.008

/ 1M tokens · estimated, set at launch

API providers

downloads / mo

549.1K

license

apache-2.0

specs

Task	Text Embedding
Architecture	Qwen3 Transformer
Parameters	4B (W4A16 quantized)
License	Apache 2.0

about this model

boboliu/Qwen3-Embedding-4B-W4A16-G128 is a text embedding model, a GPTQ 4-bit quantized variant of the Qwen3-Embedding-4B (Apache 2.0), designed to generate dense vector representations for retrieval, classification, clustering, reranking, and bitext mining. It supports a 32,768-token context length, flexible embedding dimensions via Matryoshka Representation Learning (MRL), and user-defined instructions for task-specific optimization—using custom instructions typically yields 1–5% improvement.

The quantized model reduces GPU VRAM usage from 17,430 MB to 11,000 MB (without Flash Attention 2) while incurring a mean performance loss of approximately 0.72% on the C-MTEB benchmark.

C-MTEB Evaluation (multilingual)

Model	Params	Mean	Class.	Clust.	Pair Class.	Rerank.	Retr.	STS
multilingual-e5-large-instruct	0.6B	58.08	69.80	48.23	64.52	57.45	63.65	45.81
bge-multilingual-gemma2	9B	67.64	75.31	59.30	86.67	68.28	73.73	55.19
gte-Qwen2-1.5B-instruct	1.5B	67.12	72.53	54.61	79.50	68.21	71.86	60.05
gte-Qwen2-7B-instruct	7.6B	71.62	75.77	66.06	81.16	69.24	75.70	65.20
ritrieve_zh_v1	0.3B	72.71	76.88	66.50	85.98	72.86	76.97	63.92
Qwen3-Embedding-4B	4B	72.27	75.46	77.89	83.34	66.05	77.03	61.26
This model	4B-W4A16	71.75	75.43	77.51	83.04	65.73	76.15	60.47

The parent Qwen3-Embedding family includes an 8B variant that achieved No. 1 on the MTEB multilingual leaderboard (score 70.58, June 2025). This quantized 4B model maintains strong performance across over 100 languages with a compact memory footprint suitable for cost-efficient deployment.

best for

·Multilingual text retrieval across 100+ languages
·Document and text classification
·Code retrieval and search
·Bitext mining and text clustering

FAQ

What is the input format for this model?

Accepts single strings or pairs of strings (for similarity) and returns a vector embedding of dimension 2560 (or any dimension using Matryoshka Representation Learning).

How does this quantized version compare to the original 4B model?

It uses W4A16 quantization (4-bit weights, 16-bit activations) reducing VRAM from ~17.4GB to ~11GB with only ~0.72% performance loss on C-MTEB.

What is the performance on C-MTEB?

This model achieves a mean score of 71.75 on C-MTEB (overall tasks), compared to 72.27 for the original 4B model.

Can I use custom instructions to improve performance?

Yes, the model supports instruction-aware embeddings. Using task-specific instructions (in English) can yield 1% to 5% improvement on benchmarks.

How do I call this model via the gigarouter API?

Use the OpenAI-compatible endpoint with your API key, passing an input text and specifying the model as boboliu/Qwen3-Embedding-4B-W4A16-G128.

not yet live

We're benchmarking and onboarding Qwen3 Embedding 4B as a hosted, OpenAI-compatible API. Sign in for free credit and be ready when it lands, or tell us you want it and we'll prioritize it.

related embeddings models

compare all →

nomic-embed-text-v1.5