skip to content
gigarouter gigarouter
models / vision-language · coming soon

Qwen2 VL 2B Instruct

Qwen/Qwen2-VL-2B-Instruct

published Aug 2024 · updated Jan 2025

Qwen2 VL 2B Instruct is a vision-language model that understands images, videos, and multilingual text, supporting dynamic resolution and advanced reasoning.

est. price
~$0.626
/ 1k images · estimated, set at launch
API providers
0
downloads / mo
3.6M
license
apache-2.0

specs

TaskVision-Language Model (Visual Understanding, Video QA, Document QA)
ArchitectureTransformer with Multimodal Rotary Position Embedding (M-RoPE) and Naive Dynamic Resolution
Parameters2 billion

about this model

Qwen2-VL-2B-Instruct is a vision-language model that processes images, videos, and text to perform visual understanding, question answering, document analysis, and reasoning tasks. The model introduces Naive Dynamic Resolution, allowing it to handle arbitrary image resolutions by mapping them into a variable number of visual tokens. It also employs Multimodal Rotary Position Embedding (M-RoPE), which separately captures positional information for 1D text, 2D images, and 3D video, improving multimodal processing. Diagram depicting dynamic resolution processing Illustration of Multimodal Rotary Position Embedding (M-RoPE) Key capabilities include understanding videos longer than 20 minutes for question answering and content creation, supporting text extraction in multiple languages (European languages, Japanese, Korean, Arabic, Vietnamese) within images, and operating as an agent for mobile and robotic devices via visual environment interpretation and text instructions.

Performance Benchmarks

The model achieves strong results across visual understanding benchmarks. Selected image and video benchmark scores are shown below.
BenchmarkScore
MMMU41.1
DocVQA90.1
InfoVQA65.5
ChartQA73.5
TextVQA79.7
OCRBench794
RealWorldQA62.9
MMVet49.5
MathVista43.0
MVBench63.2
Video-MME (wo/ subs)55.6
Video-MME (w/ subs)60.4
Qwen2-VL-2B-Instruct outperforms comparable models (e.g., InternVL2-2B, MiniCPM-V 2.0) on most reported benchmarks. This model is hosted by gigarouter as a managed, OpenAI-compatible API.

best for

FAQ

What is the parameter size of Qwen2 VL 2B Instruct?

It has 2 billion parameters.

Can it process videos? What is the maximum length?

Yes, it supports videos over 20 minutes for QA and dialog.

What input formats are supported?

Images (URL, base64, file), video (frames or file), and text.

How do I call this model via the gigarouter API?

Use the OpenAI-compatible endpoint with your API key; refer to gigarouter documentation for details.

Does it support multilingual text in images?

Yes, it supports text in most European languages, Japanese, Korean, Arabic, Vietnamese, and more.

not yet live

We're benchmarking and onboarding Qwen2 VL 2B Instruct as a hosted, OpenAI-compatible API. Sign in for free credit and be ready when it lands, or tell us you want it and we'll prioritize it.

related vision-language models

compare all →