Qwen2.5 VL 3B

Qwen/Qwen2.5-VL-3B-Instruct

published Jan 2025 · updated Apr 2025

Qwen2.5 VL 3B is a vision-language model that processes images and videos to generate text, supporting visual reasoning, agent tasks, document analysis, and long-video understanding.

est. price

~$0.626

/ 1k images · estimated, set at launch

API providers

downloads / mo

5.3M

specs

Task	image-text-to-text
Architecture	Qwen2.5 VL (ViT with window attention, SwiGLU, RMSNorm, M-RoPE)
Parameters	3 billion
License	Qwen Research License

about this model

Qwen2.5-VL-3B-Instruct is a vision-language model (VLM) that processes images, videos, and text to perform visual understanding, document analysis, visual agent tasks, and structured output generation. It is the 3-billion-parameter instruction-tuned variant in the Qwen2.5-VL series.

Key Capabilities

Visual understanding: Recognizes common objects, texts, charts, icons, graphics, and layouts within images.
Agentic functionality: Acts as a visual agent capable of reasoning and dynamically directing tools, including computer and phone use.
Long video comprehension: Understands videos over one hour and can pinpoint relevant video segments for specific events.
Visual localization: Generates bounding boxes or points for objects and provides stable JSON outputs for coordinates and attributes.
Structured outputs: Extracts structured data from invoices, forms, tables, and similar documents.

Architecture

Employs a dynamic resolution mechanism for images and dynamic FPS sampling for videos, with an updated vision encoder using window attention, SwiGLU, and RMSNorm. Multimodal Rotary Position Embedding (M-RoPE) enables fusion of positional information across text, images, and videos.

Benchmark Performance

Image benchmarks:

Benchmark	Qwen2.5-VL-3B	Qwen2-VL-7B	InternVL2.5-4B
MMMU	53.1	54.1	52.3
DocVQA	93.9	94.5	91.6
InfoVQA	77.1	76.5	72.1
MathVista	62.3	58.2	60.5
MathVision	21.2	16.3	20.9

Video benchmarks: MLVU: 68.2, VideoMME: 67.6/61.5, MVBench: 67.0, EgoSchema: 64.8, PerceptionTest: 66.9.

Agent benchmarks: ScreenSpot: 55.5, AndroidWorld_SR: 90.8, AITZ_EM: 76.9.

Input/Output

Accepts interleaved images, videos, and text. Outputs text, bounding boxes, points, and structured JSON. Supports dynamic token allocation per image (4–16,384 visual tokens).

best for

·Analyzing documents, charts, and invoices
·Visual agent for computer and mobile GUI interaction
·Long video comprehension and event pinpointing
·Structured data extraction from scanned forms

FAQ

What is Qwen2.5 VL 3B best used for?

It excels at visual understanding tasks such as OCR, chart analysis, visual agent interaction, long video comprehension, and structured output extraction.

How does it compare to the 7B and 72B versions?

The 3B model is smaller and faster, offering strong performance on benchmarks like DocVQA and MathVista while being more efficient for deployment.

What license is it released under?

It uses the Qwen Research License, which is specific to the Qwen model family.

What input formats does it support?

It accepts images (URLs or base64), videos (paths or URLs), and text interleaved with visual content.

How can I call this model via API?

Access it through the gigarouter OpenAI-compatible endpoint using your API key. Send requests with image/video and text inputs.

not yet live

We're benchmarking and onboarding Qwen2.5 VL 3B as a hosted, OpenAI-compatible API. Sign in for free credit and be ready when it lands, or tell us you want it and we'll prioritize it.

related vision-language models

compare all →

Qwen2.5-VL-7B-Instruct

9.8M dl/mo

Qwen3.6-35B-A3B-FP8

6.2M dl/mo

gemma-4-26B-A4B-it-AWQ-4bit