Qwen3 VL Reranker 2B
Qwen/Qwen3-VL-Reranker-2B
published Jan 2026 · updated Apr 2026
Qwen3 VL Reranker 2B is a cross-encoder reranker that accepts query-document pairs comprising text, images, screenshots, video, or mixed modalities and outputs a precise relevance score.
specs
| Task | Multimodal Reranking |
| Architecture | Cross-encoder with cross-attention |
| Parameters | 2B |
| License | Apache 2.0 |
| Context Length | 32K tokens |
about this model
Qwen3-VL-Reranker-2B is a multimodal reranking model that refines retrieval results by computing precise relevance scores for query-document pairs, supporting text, images, screenshots, video, and mixed-modality inputs.
Built on the Qwen3-VL foundation, the model uses a cross-encoder architecture with cross-attention mechanisms to estimate fine-grained relevance. It supports over 30 languages and handles inputs up to 32K tokens. The model is instruction-aware: customizing the prompt for the specific task typically improves performance by 1% to 5% compared to no instruction. While commonly paired with an embedding model in a two-stage retrieval pipeline for initial recall then reranking, it can also be used standalone.
| Benchmark | Metric | Score |
|---|---|---|
| MMEB-v2 (Retrieval) – Avg | Average | 75.1 |
| MMEB-v2 (Retrieval) – Image | Image retrieval | 73.8 |
| MMEB-v2 (Retrieval) – Video | Video retrieval | 52.1 |
| MMEB-v2 (Retrieval) – VisDoc | Visual document retrieval | 83.4 |
| MMTEB (Retrieval) | Text retrieval | 70.0 |
| JinaVDR | Visual document retrieval | 80.9 |
| ViDoRe v3 | Visual document retrieval | 60.8 |
Across all evaluated benchmarks, Qwen3-VL-Reranker-2B consistently outperforms the base embedding model and baseline rerankers of comparable size. The larger 8B variant achieves the best overall results (79.2 on MMEB-v2 Avg, 86.3 on VisDoc).

best for
- ·Image-text relevance scoring
- ·Video-text matching and retrieval
- ·Multimodal document ranking (text + images + video)
FAQ
It is best for re-ranking initial retrieval results by scoring relevance between a query (text, image, or video) and candidate documents (text, image, or video), significantly improving retrieval accuracy in multimodal pipelines.
The Embedding model generates vector embeddings for efficient first-stage recall, while the Reranker performs fine-grained cross-encoder scoring on the retrieved candidates. Used together, they form a high-accuracy two-stage retrieval pipeline.
The model accepts pairs of queries and documents, where each can be plain text, an image URL, a video, or a mix of text and image. It supports over 30 languages and a context length of 32K tokens.
Use the gigarouter OpenAI-compatible endpoint with your API key, sending a request with the query and documents as inputs. The response will include relevance scores for each document.
It is released under the Apache 2.0 license, allowing commercial use and modification.
We're benchmarking and onboarding Qwen3 VL Reranker 2B as a hosted, OpenAI-compatible API. Sign in for free credit and be ready when it lands, or tell us you want it and we'll prioritize it.