Granite Vision 3.3 2B
ibm-granite/granite-vision-3.3-2b
published Jun 2025 · updated Apr 2026
Granite Vision 3.3 2B is a compact vision-language model designed for visual document understanding, enabling automated content extraction from tables, charts, infographics, plots, diagrams, and general image understanding.
specs
| Task | Image-to-Text (Visual Document Understanding & General Image QA) |
| Architecture | Vision Encoder: SigLIP2; LLM: Granite 3.1 2B Instruct; Two-layer MLP connector |
| Parameters | 2.97B total |
| License | Apache 2.0 |
about this model
Key Strengths
The model excels at document understanding tasks, achieving strong benchmark results compared to prior Granite Vision versions:
| Benchmark | Granite-vision-3.3-2b | Granite-vision-3.2-2b |
|---|---|---|
| DocVQA | 0.91 | 0.89 |
| ChartQA | 0.87 | 0.87 |
| TextVQA | 0.80 | 0.78 |
| InfoVQA | 0.68 | 0.64 |
| OCRBench | 0.79 | 0.77 |
Safety alignment has been improved across all measured dimensions. On the RTVLM benchmark, the model scores 8.0 (Politics), 8.1 (Racial), 7.5 (Jailbreak), and 8.0 (Mislead). On VLGuard, it scores 8.4 for unsafe images and 9.3 for safe images with unsafe instructions.
Experimental Capabilities
The model introduces three experimental features: image segmentation, doctags generation for parsing document images into structured text, and multi-page support for question answering across up to 8 consecutive document pages.
Architecture and Training
The model uses a SigLIP2 vision encoder, a two-layer MLP vision-language connector, and the granite-3.1-2b-instruct language model with 128k context length. Training data includes publicly available datasets, internally created synthetic data for document understanding, and high-quality sources such as Mammoth-12M and Bigdocs. The model is released under the Apache 2.0 license.
best for
- ·Extracting data from tables and charts in documents
- ·Answering questions about document content (e.g., invoices, reports)
- ·Optical character recognition in scanned documents
- ·Image segmentation and multi-page document QA (experimental)
FAQ
English instructions and images in PNG or JPEG format.
Apache 2.0, allowing both research and commercial use.
It shows improvements on DocVQA, TextVQA, InfoVQA, OCRBench, and safety benchmarks, while maintaining similar performance on ChartQA and others.
Image segmentation, doctags generation (structured text from documents), and multi-page support (up to 8 pages).
Use the OpenAI-compatible endpoint with your gigarouter API key; format the request with an image URL or base64-encoded image and a text prompt.
We're benchmarking and onboarding Granite Vision 3.3 2B as a hosted, OpenAI-compatible API. Sign in for free credit and be ready when it lands, or tell us you want it and we'll prioritize it.