Qwen2 VL 2B Instruct

Qwen/Qwen2-VL-2B-Instruct

published Aug 2024 · updated Jan 2025

Qwen2 VL 2B Instruct is a vision-language model that understands images, videos, and multilingual text, supporting dynamic resolution and advanced reasoning.

est. price

~$0.626

/ 1k images · estimated, set at launch

API providers

downloads / mo

3.6M

license

apache-2.0

specs

Task	Vision-Language Model (Visual Understanding, Video QA, Document QA)
Architecture	Transformer with Multimodal Rotary Position Embedding (M-RoPE) and Naive Dynamic Resolution
Parameters	2 billion

about this model

Qwen2-VL-2B-Instruct is a vision-language model that processes images, videos, and text to perform visual understanding, question answering, document analysis, and reasoning tasks. The model introduces Naive Dynamic Resolution, allowing it to handle arbitrary image resolutions by mapping them into a variable number of visual tokens. It also employs Multimodal Rotary Position Embedding (M-RoPE), which separately captures positional information for 1D text, 2D images, and 3D video, improving multimodal processing. Diagram depicting dynamic resolution processing

Diagram depicting dynamic resolution processing

Illustration of Multimodal Rotary Position Embedding (M-RoPE)

Key capabilities include understanding videos longer than 20 minutes for question answering and content creation, supporting text extraction in multiple languages (European languages, Japanese, Korean, Arabic, Vietnamese) within images, and operating as an agent for mobile and robotic devices via visual environment interpretation and text instructions.

Performance Benchmarks

The model achieves strong results across visual understanding benchmarks. Selected image and video benchmark scores are shown below.

Benchmark	Score
MMMU	41.1
DocVQA	90.1
InfoVQA	65.5
ChartQA	73.5
TextVQA	79.7
OCRBench	794
RealWorldQA	62.9
MMVet	49.5
MathVista	43.0
MVBench	63.2
Video-MME (wo/ subs)	55.6
Video-MME (w/ subs)	60.4

Qwen2-VL-2B-Instruct outperforms comparable models (e.g., InternVL2-2B, MiniCPM-V 2.0) on most reported benchmarks. This model is hosted by gigarouter as a managed, OpenAI-compatible API.

best for

·Image description and question answering across varied resolutions
·Long-form video understanding (20+ minutes) and QA
·Document and chart understanding (DocVQA, ChartQA)

FAQ

What is the parameter size of Qwen2 VL 2B Instruct?

It has 2 billion parameters.

Can it process videos? What is the maximum length?

Yes, it supports videos over 20 minutes for QA and dialog.

What input formats are supported?

Images (URL, base64, file), video (frames or file), and text.

How do I call this model via the gigarouter API?

Use the OpenAI-compatible endpoint with your API key; refer to gigarouter documentation for details.

Does it support multilingual text in images?

Yes, it supports text in most European languages, Japanese, Korean, Arabic, Vietnamese, and more.

not yet live

We're benchmarking and onboarding Qwen2 VL 2B Instruct as a hosted, OpenAI-compatible API. Sign in for free credit and be ready when it lands, or tell us you want it and we'll prioritize it.

related vision-language models

compare all →

Qwen2.5-VL-7B-Instruct

9.8M dl/mo

Qwen3.6-35B-A3B-FP8

6.2M dl/mo

Qwen2.5-VL-3B-Instruct

5.3M dl/mo

gemma-4-26B-A4B-it-AWQ-4bit