Qwen2.5 VL 3B
Qwen/Qwen2.5-VL-3B-Instruct
published Jan 2025 · updated Apr 2025
Qwen2.5 VL 3B is a vision-language model that processes images and videos to generate text, supporting visual reasoning, agent tasks, document analysis, and long-video understanding.
specs
| Task | image-text-to-text |
| Architecture | Qwen2.5 VL (ViT with window attention, SwiGLU, RMSNorm, M-RoPE) |
| Parameters | 3 billion |
| License | Qwen Research License |
about this model
Key Capabilities
- Visual understanding: Recognizes common objects, texts, charts, icons, graphics, and layouts within images.
- Agentic functionality: Acts as a visual agent capable of reasoning and dynamically directing tools, including computer and phone use.
- Long video comprehension: Understands videos over one hour and can pinpoint relevant video segments for specific events.
- Visual localization: Generates bounding boxes or points for objects and provides stable JSON outputs for coordinates and attributes.
- Structured outputs: Extracts structured data from invoices, forms, tables, and similar documents.
Architecture
Employs a dynamic resolution mechanism for images and dynamic FPS sampling for videos, with an updated vision encoder using window attention, SwiGLU, and RMSNorm. Multimodal Rotary Position Embedding (M-RoPE) enables fusion of positional information across text, images, and videos.
Benchmark Performance
Image benchmarks:
| Benchmark | Qwen2.5-VL-3B | Qwen2-VL-7B | InternVL2.5-4B |
|---|---|---|---|
| MMMU | 53.1 | 54.1 | 52.3 |
| DocVQA | 93.9 | 94.5 | 91.6 |
| InfoVQA | 77.1 | 76.5 | 72.1 |
| MathVista | 62.3 | 58.2 | 60.5 |
| MathVision | 21.2 | 16.3 | 20.9 |
Video benchmarks: MLVU: 68.2, VideoMME: 67.6/61.5, MVBench: 67.0, EgoSchema: 64.8, PerceptionTest: 66.9.
Agent benchmarks: ScreenSpot: 55.5, AndroidWorld_SR: 90.8, AITZ_EM: 76.9.
Input/Output
Accepts interleaved images, videos, and text. Outputs text, bounding boxes, points, and structured JSON. Supports dynamic token allocation per image (4–16,384 visual tokens).
best for
- ·Analyzing documents, charts, and invoices
- ·Visual agent for computer and mobile GUI interaction
- ·Long video comprehension and event pinpointing
- ·Structured data extraction from scanned forms
FAQ
It excels at visual understanding tasks such as OCR, chart analysis, visual agent interaction, long video comprehension, and structured output extraction.
The 3B model is smaller and faster, offering strong performance on benchmarks like DocVQA and MathVista while being more efficient for deployment.
It uses the Qwen Research License, which is specific to the Qwen model family.
It accepts images (URLs or base64), videos (paths or URLs), and text interleaved with visual content.
Access it through the gigarouter OpenAI-compatible endpoint using your API key. Send requests with image/video and text inputs.
We're benchmarking and onboarding Qwen2.5 VL 3B as a hosted, OpenAI-compatible API. Sign in for free credit and be ready when it lands, or tell us you want it and we'll prioritize it.