Qwen2-VL 7B Instruct

Qwen/Qwen2-VL-7B-Instruct

published Aug 2024 · updated Feb 2025

Qwen2-VL 7B Instruct is a vision-language model that understands images, videos, and text with state-of-the-art performance on visual understanding benchmarks.

est. price

~$1.341

/ 1k images · estimated, set at launch

API providers

downloads / mo

1.8M

license

apache-2.0

specs

Task	Multimodal Understanding (Image, Video, Text)
Architecture	Vision-Language Model with Naive Dynamic Resolution and Multimodal Rotary Position Embedding (M-ROPE)
Parameters	7 billion
License	Apache 2.0

about this model

Qwen2-VL-7B-Instruct is a vision-language model (VLM) that processes images and videos at arbitrary resolutions, supporting text-based question answering, document parsing, video understanding, and visual agent tasks.

Capabilities and Architecture

The model introduces Naive Dynamic Resolution, mapping images of varying sizes into a dynamic number of visual tokens for efficient representation. Multimodal Rotary Position Embedding (M-RoPE) captures 1D textual, 2D visual, and 3D video positional information, enabling coherent multimodal processing. Qwen2-VL-7B-Instruct can understand videos over 20 minutes and supports multilingual text within images, including European languages, Japanese, Korean, Arabic, and Vietnamese.

Diagram illustrating the Naive Dynamic Resolution mechanism of Qwen2-VL, showing how images of different sizes are split into rectangular patches feeding into the vision encoder.

Diagram of Multimodal Rotary Position Embedding (M-RoPE) showing decomposed positional embeddings for text, 2D images, and 3D video.

Benchmark Performance

On image understanding benchmarks, Qwen2-VL-7B-Instruct achieves strong results:

Benchmark	Score	Comparison
DocVQA	94.5	Outperforms InternVL2-8B (91.6) and MiniCPM-V 2.6 (90.8)
MTVQA	26.3	No comparable scores listed for other models
RealWorldQA	70.1	Outperforms InternVL2-8B (64.4)
TextVQA	84.3	Outperforms InternVL2-8B (77.4) and MiniCPM-V 2.6 (80.1)
MMMU	54.1	Below GPT-4o-mini (60) but competitive
MMVet	62.0	Below GPT-4o-mini (66.9) but above InternVL2-8B (54.2)

On video benchmarks, the model leads in its size class:

Benchmark	Score	Comparison
Video-MME	63.3 / 69.0	Outperforms InternVL2-8B (54.0/56.9) and MiniCPM-V 2.6 (60.9/63.6)
MVBench	67.0	Outperforms InternVL2-8B (66.4) and LLaVA-OneVision-7B (56.7)

Additional benchmarks include MMBench-EN (83.0), MMBench-V1.1 (80.7), and HallBench (50.6). The model is released under the Apache 2.0 license.

best for

·Document and chart understanding (e.g., DocVQA, ChartQA)
·Long-form video question answering (20+ minutes)
·Visual agent for operating mobile phones and robots

FAQ

What license is Qwen2-VL 7B Instruct released under?

It is released under the Apache 2.0 license.

What input formats does the model support?

It supports images, videos, and text. It can handle arbitrary image resolutions via Naive Dynamic Resolution.

How does Qwen2-VL 7B Instruct compare to GPT-4o-mini on visual benchmarks?

It outperforms GPT-4o-mini on many benchmarks, including MMMU, DocVQA, TextVQA, and RealWorldQA.

Can I use the model for multilingual text recognition?

Yes, it supports text understanding in multiple languages including English, Chinese, most European languages, Japanese, Korean, Arabic, and Vietnamese.

How do I call Qwen2-VL 7B Instruct via API on gigarouter?

Use the OpenAI-compatible endpoint with your API key. The model accepts text, image URLs, and video inputs in the standard chat format.

not yet live

We're benchmarking and onboarding Qwen2-VL 7B Instruct as a hosted, OpenAI-compatible API. Sign in for free credit and be ready when it lands, or tell us you want it and we'll prioritize it.

related vision-language models

compare all →

Qwen2.5-VL-7B-Instruct

9.8M dl/mo

Qwen3.6-35B-A3B-FP8

6.2M dl/mo

Qwen2.5-VL-3B-Instruct

5.3M dl/mo

gemma-4-26B-A4B-it-AWQ-4bit