skip to content
gigarouter gigarouter
models / embeddings · coming soon

Qwen3 Embedding 4B

boboliu/Qwen3-Embedding-4B-W4A16-G128

published Jun 2025 · updated Jun 2025

Qwen3 Embedding 4B is a GPTQ-quantized text embedding model that generates dense vector representations for multilingual text, supporting retrieval, classification, clustering, and ranking.

est. price
~$0.008
/ 1M tokens · estimated, set at launch
API providers
0
downloads / mo
549.1K
license
apache-2.0

specs

TaskText Embedding
ArchitectureQwen3 Transformer
Parameters4B (W4A16 quantized)
LicenseApache 2.0

about this model

boboliu/Qwen3-Embedding-4B-W4A16-G128 is a text embedding model, a GPTQ 4-bit quantized variant of the Qwen3-Embedding-4B (Apache 2.0), designed to generate dense vector representations for retrieval, classification, clustering, reranking, and bitext mining. It supports a 32,768-token context length, flexible embedding dimensions via Matryoshka Representation Learning (MRL), and user-defined instructions for task-specific optimization—using custom instructions typically yields 1–5% improvement.

The quantized model reduces GPU VRAM usage from 17,430 MB to 11,000 MB (without Flash Attention 2) while incurring a mean performance loss of approximately 0.72% on the C-MTEB benchmark.

C-MTEB Evaluation (multilingual)

ModelParamsMeanClass.Clust.Pair Class.Rerank.Retr.STS
multilingual-e5-large-instruct0.6B58.0869.8048.2364.5257.4563.6545.81
bge-multilingual-gemma29B67.6475.3159.3086.6768.2873.7355.19
gte-Qwen2-1.5B-instruct1.5B67.1272.5354.6179.5068.2171.8660.05
gte-Qwen2-7B-instruct7.6B71.6275.7766.0681.1669.2475.7065.20
ritrieve_zh_v10.3B72.7176.8866.5085.9872.8676.9763.92
Qwen3-Embedding-4B4B72.2775.4677.8983.3466.0577.0361.26
This model4B-W4A1671.7575.4377.5183.0465.7376.1560.47

The parent Qwen3-Embedding family includes an 8B variant that achieved No. 1 on the MTEB multilingual leaderboard (score 70.58, June 2025). This quantized 4B model maintains strong performance across over 100 languages with a compact memory footprint suitable for cost-efficient deployment.

best for

FAQ

What is the input format for this model?

Accepts single strings or pairs of strings (for similarity) and returns a vector embedding of dimension 2560 (or any dimension using Matryoshka Representation Learning).

How does this quantized version compare to the original 4B model?

It uses W4A16 quantization (4-bit weights, 16-bit activations) reducing VRAM from ~17.4GB to ~11GB with only ~0.72% performance loss on C-MTEB.

What is the performance on C-MTEB?

This model achieves a mean score of 71.75 on C-MTEB (overall tasks), compared to 72.27 for the original 4B model.

Can I use custom instructions to improve performance?

Yes, the model supports instruction-aware embeddings. Using task-specific instructions (in English) can yield 1% to 5% improvement on benchmarks.

How do I call this model via the gigarouter API?

Use the OpenAI-compatible endpoint with your API key, passing an input text and specifying the model as boboliu/Qwen3-Embedding-4B-W4A16-G128.

not yet live

We're benchmarking and onboarding Qwen3 Embedding 4B as a hosted, OpenAI-compatible API. Sign in for free credit and be ready when it lands, or tell us you want it and we'll prioritize it.

related embeddings models

compare all →