Qwen2.5-VL 7B Instruct AWQ

Qwen/Qwen2.5-VL-7B-Instruct-AWQ

published Feb 2025 · updated Apr 2025

Qwen2.5-VL 7B Instruct AWQ is a vision-language model that understands images, videos, and text, supports agentic tool use, and provides structured outputs and visual localization.

est. price

~$1.341

/ 1k images · estimated, set at launch

API providers

downloads / mo

1.7M

license

apache-2.0

specs

Task	Vision-Language (VLM)
Architecture	Vision Transformer (ViT) with SwiGLU and RMSNorm, plus Qwen2.5 LLM
Parameters	7B
License	Not specified in card

about this model

Qwen2.5-VL-7B-Instruct-AWQ is a vision-language model (VLM) that processes images, videos, and text to perform visual understanding, localization, and structured output generation. It is optimized for efficient inference through AWQ quantization while maintaining high performance across diverse multimodal benchmarks.

Key Capabilities

Dynamic Resolution & Video Understanding: The model handles images at native resolution and supports variable frame rate sampling for videos, enabling comprehension of content over one hour long and pinpointing specific events.
Visual Agent & Localization: It can act as a visual agent for computer and phone use, generate bounding boxes or points for object localization, and produce stable JSON outputs for coordinates and attributes.
Structured Outputs: Scanned invoices, forms, tables, and similar documents are processed into structured data, benefiting finance, commerce, and document analysis workflows.
Extended Context: The model supports YaRN for context extension beyond 32,768 tokens, allowing handling of longer sequences with a factor of 4.

Diagram illustrating dynamic FPS sampling and time-aware mRoPE for video understanding.

Benchmark Performance (AWQ Quantized)

Evaluated using VLMEvalKit, the 7B AWQ model achieves the following scores:

Benchmark	Score
MMMU_VAL (Accuracy)	55.6
DocVQA_VAL (Accuracy)	94.6
MMBench_DEV_EN (Accuracy)	84.2
MathVista_MINI (Accuracy)	64.7

For comparison, the BF16 version of the same model achieves 58.4, 94.9, 84.1, and 67.9 respectively. The 7B model also outperforms GPT-4o-mini on several college-level, math, document understanding, and video understanding tasks (source: official blog).

Architecture

The vision encoder uses window attention, SwiGLU, and RMSNorm, aligned with the Qwen2.5 LLM. M-RoPE with absolute time alignment enables temporal sequence learning, and the model supports dynamic resolution for both spatial and temporal dimensions.

best for

·Visual question answering and image description
·Document and invoice data extraction with structured JSON output
·Long video understanding and event localization

FAQ

What input formats does Qwen2.5-VL 7B Instruct AWQ support?

It accepts images (local files, URLs, base64), videos (local files), and text in a chat template.

How can I call this model via the gigarouter API?

Use the OpenAI-compatible endpoint with your API key, sending a chat completion request with image and text inputs.

Does this model support context extension beyond 32K tokens?

Yes, it supports YaRN for extending context up to 4x the original 32,768 tokens, though this may affect spatial/temporal localization tasks.

What is the AWQ quantization effect on performance?

AWQ reduces model size and speeds inference with minimal accuracy loss; e.g., on MMMU_VAL it scores 55.6 vs 58.4 for BF16.

not yet live

We're benchmarking and onboarding Qwen2.5-VL 7B Instruct AWQ as a hosted, OpenAI-compatible API. Sign in for free credit and be ready when it lands, or tell us you want it and we'll prioritize it.

related vision-language models

compare all →

Qwen2.5-VL-7B-Instruct

9.8M dl/mo

Qwen3.6-35B-A3B-FP8

6.2M dl/mo

Qwen2.5-VL-3B-Instruct

5.3M dl/mo

gemma-4-26B-A4B-it-AWQ-4bit