skip to content
gigarouter gigarouter
models / vision-language · coming soon

Qwen2-VL 7B Instruct

Qwen/Qwen2-VL-7B-Instruct-AWQ

published Aug 2024 · updated Sep 2024

Qwen2-VL 7B Instruct is a vision-language model that understands images, videos, and text, supporting dynamic resolution, multilingual text, and agent operations.

est. price
~$1.341
/ 1k images · estimated, set at launch
API providers
0
downloads / mo
1.8M
license
apache-2.0

specs

TaskImage-text-to-text (Vision-Language Understanding)
ArchitectureTransformer with Multimodal Rotary Position Embedding (M-ROPE) and Naive Dynamic Resolution
Parameters7 billion
LicenseApache 2.0

about this model

Qwen2-VL-7B-Instruct-AWQ is a vision-language model (VLM) that processes images, videos, and text for tasks such as visual question answering, document understanding, and multimodal reasoning. It is the instruction-tuned 7B-parameter variant of the Qwen2-VL series, quantized with AWQ for efficient deployment under an Apache 2.0 license. The model is associated with the paper Qwen2-VL: Enhancing Vision-Language Model’s Perception of the World at Any Resolution (arXiv:2409.12191) and is categorized as an image-text-to-text pipeline.

Key Capabilities

  • Dynamic resolution: The Naive Dynamic Resolution mechanism maps arbitrary image resolutions to a variable number of visual tokens, adapting to content complexity.
  • Multimodal positional encoding: Multimodal Rotary Position Embedding (M-RoPE) decomposes positional information into 1D textual, 2D visual, and 3D video components.
  • Extended video understanding: Capable of comprehending videos longer than 20 minutes for QA, dialog, and content creation.
  • Agentic operation: Can be integrated with mobile phones, robots, and other devices to perform actions based on visual input and text instructions.
  • Multilingual image text: Supports reading text in images across English, Chinese, most European languages, Japanese, Korean, Arabic, Vietnamese, and others.

Architecture

Diagram illustrating Naive Dynamic Resolution

Diagram illustrating Multimodal Rotary Position Embedding (M-RoPE)

Benchmark Performance (AWQ Quantization)

BenchmarkAccuracy
MMMU_VAL53.66
DocVQA_VAL93.10
MMBench_DEV_EN81.61
MathVista_MINI56.80

Inference Speed (AWQ, single A100 80GB, batch size 1)

Input LengthSpeed (tokens/s)GPU Memory (GB)
132.087.07
614432.6612.56
1433631.4220.07
3072019.9535.08

best for

FAQ

What is the model best for?

It excels at visual understanding, document QA, video comprehension, and agent tasks, achieving state-of-the-art on benchmarks like DocVQA, MTVQA, and MathVista.

What license is the model released under?

The model is released under the Apache 2.0 license.

What input formats does the model support?

It supports images (URLs, base64, file paths), videos (as frame lists or video files), and text, with dynamic resolution handling.

How can I call this model via the gigarouter API?

Use the OpenAI-compatible endpoint with your API key, sending image and text inputs as specified in the gigarouter documentation.

How does Qwen2-VL 7B Instruct compare to other vision-language models?

It achieves competitive performance against leading models, with state-of-the-art results on DocVQA and MTVQA, and matches GPT-4o and Claude 3.5 on several benchmarks.

not yet live

We're benchmarking and onboarding Qwen2-VL 7B Instruct as a hosted, OpenAI-compatible API. Sign in for free credit and be ready when it lands, or tell us you want it and we'll prioritize it.

related vision-language models

compare all →