Granite Vision 3.3 2B

ibm-granite/granite-vision-3.3-2b

published Jun 2025 · updated Apr 2026

Granite Vision 3.3 2B is a compact vision-language model designed for visual document understanding, enabling automated content extraction from tables, charts, infographics, plots, diagrams, and general image understanding.

est. price

~$0.626

/ 1k images · estimated, set at launch

API providers

downloads / mo

343.3K

license

apache-2.0

specs

Task	Image-to-Text (Visual Document Understanding & General Image QA)
Architecture	Vision Encoder: SigLIP2; LLM: Granite 3.1 2B Instruct; Two-layer MLP connector
Parameters	2.97B total
License	Apache 2.0

about this model

Granite-vision-3.3-2b is a vision-language model designed for visual document understanding, enabling automated content extraction from tables, charts, infographics, plots, diagrams, and general images. It is a compact model with approximately 2.97 billion total parameters, built by fine-tuning a 2-billion-parameter Granite large language model with a SigLIP2 vision encoder and a two-layer MLP connector.

Key Strengths

The model excels at document understanding tasks, achieving strong benchmark results compared to prior Granite Vision versions:

Benchmark	Granite-vision-3.3-2b	Granite-vision-3.2-2b
DocVQA	0.91	0.89
ChartQA	0.87	0.87
TextVQA	0.80	0.78
InfoVQA	0.68	0.64
OCRBench	0.79	0.77

Safety alignment has been improved across all measured dimensions. On the RTVLM benchmark, the model scores 8.0 (Politics), 8.1 (Racial), 7.5 (Jailbreak), and 8.0 (Mislead). On VLGuard, it scores 8.4 for unsafe images and 9.3 for safe images with unsafe instructions.

Experimental Capabilities

The model introduces three experimental features: image segmentation, doctags generation for parsing document images into structured text, and multi-page support for question answering across up to 8 consecutive document pages.

Architecture and Training

The model uses a SigLIP2 vision encoder, a two-layer MLP vision-language connector, and the granite-3.1-2b-instruct language model with 128k context length. Training data includes publicly available datasets, internally created synthetic data for document understanding, and high-quality sources such as Mammoth-12M and Bigdocs. The model is released under the Apache 2.0 license.

best for

·Extracting data from tables and charts in documents
·Answering questions about document content (e.g., invoices, reports)
·Optical character recognition in scanned documents
·Image segmentation and multi-page document QA (experimental)

FAQ

What input formats does the model accept?

English instructions and images in PNG or JPEG format.

What license is this model released under?

Apache 2.0, allowing both research and commercial use.

How does Granite Vision 3.3 2B compare to Granite Vision 3.2 2B?

It shows improvements on DocVQA, TextVQA, InfoVQA, OCRBench, and safety benchmarks, while maintaining similar performance on ChartQA and others.

What experimental capabilities does it have?

Image segmentation, doctags generation (structured text from documents), and multi-page support (up to 8 pages).

How can I call this model via the gigarouter API?

Use the OpenAI-compatible endpoint with your gigarouter API key; format the request with an image URL or base64-encoded image and a text prompt.

not yet live

We're benchmarking and onboarding Granite Vision 3.3 2B as a hosted, OpenAI-compatible API. Sign in for free credit and be ready when it lands, or tell us you want it and we'll prioritize it.

related image-to-text models

compare all →

blip-image-captioning-base

1.9M dl/mo

blip-image-captioning-large

trocr-small-handwritten

448.6K dl/mo