skip to content
gigarouter gigarouter
models / vision-language · coming soon

Qwen2-VL 7B Instruct

Qwen/Qwen2-VL-7B-Instruct

published Aug 2024 · updated Feb 2025

Qwen2-VL 7B Instruct is a vision-language model that understands images, videos, and text with state-of-the-art performance on visual understanding benchmarks.

est. price
~$1.341
/ 1k images · estimated, set at launch
API providers
0
downloads / mo
1.8M
license
apache-2.0

specs

TaskMultimodal Understanding (Image, Video, Text)
ArchitectureVision-Language Model with Naive Dynamic Resolution and Multimodal Rotary Position Embedding (M-ROPE)
Parameters7 billion
LicenseApache 2.0

about this model

Qwen2-VL-7B-Instruct is a vision-language model (VLM) that processes images and videos at arbitrary resolutions, supporting text-based question answering, document parsing, video understanding, and visual agent tasks.

Capabilities and Architecture

The model introduces Naive Dynamic Resolution, mapping images of varying sizes into a dynamic number of visual tokens for efficient representation. Multimodal Rotary Position Embedding (M-RoPE) captures 1D textual, 2D visual, and 3D video positional information, enabling coherent multimodal processing. Qwen2-VL-7B-Instruct can understand videos over 20 minutes and supports multilingual text within images, including European languages, Japanese, Korean, Arabic, and Vietnamese.

Diagram illustrating the Naive Dynamic Resolution mechanism of Qwen2-VL, showing how images of different sizes are split into rectangular patches feeding into the vision encoder. Diagram of Multimodal Rotary Position Embedding (M-RoPE) showing decomposed positional embeddings for text, 2D images, and 3D video.

Benchmark Performance

On image understanding benchmarks, Qwen2-VL-7B-Instruct achieves strong results:

BenchmarkScoreComparison
DocVQA94.5Outperforms InternVL2-8B (91.6) and MiniCPM-V 2.6 (90.8)
MTVQA26.3No comparable scores listed for other models
RealWorldQA70.1Outperforms InternVL2-8B (64.4)
TextVQA84.3Outperforms InternVL2-8B (77.4) and MiniCPM-V 2.6 (80.1)
MMMU54.1Below GPT-4o-mini (60) but competitive
MMVet62.0Below GPT-4o-mini (66.9) but above InternVL2-8B (54.2)

On video benchmarks, the model leads in its size class:

BenchmarkScoreComparison
Video-MME63.3 / 69.0Outperforms InternVL2-8B (54.0/56.9) and MiniCPM-V 2.6 (60.9/63.6)
MVBench67.0Outperforms InternVL2-8B (66.4) and LLaVA-OneVision-7B (56.7)

Additional benchmarks include MMBench-EN (83.0), MMBench-V1.1 (80.7), and HallBench (50.6). The model is released under the Apache 2.0 license.

best for

FAQ

What license is Qwen2-VL 7B Instruct released under?

It is released under the Apache 2.0 license.

What input formats does the model support?

It supports images, videos, and text. It can handle arbitrary image resolutions via Naive Dynamic Resolution.

How does Qwen2-VL 7B Instruct compare to GPT-4o-mini on visual benchmarks?

It outperforms GPT-4o-mini on many benchmarks, including MMMU, DocVQA, TextVQA, and RealWorldQA.

Can I use the model for multilingual text recognition?

Yes, it supports text understanding in multiple languages including English, Chinese, most European languages, Japanese, Korean, Arabic, and Vietnamese.

How do I call Qwen2-VL 7B Instruct via API on gigarouter?

Use the OpenAI-compatible endpoint with your API key. The model accepts text, image URLs, and video inputs in the standard chat format.

not yet live

We're benchmarking and onboarding Qwen2-VL 7B Instruct as a hosted, OpenAI-compatible API. Sign in for free credit and be ready when it lands, or tell us you want it and we'll prioritize it.

related vision-language models

compare all →