Granite Vision 4.1 4B

ibm-granite/granite-vision-4.1-4b

published Apr 2026 · updated May 2026

Granite Vision 4.1 4B is a vision-language model that delivers frontier-level performance on structured document extraction tasks in a compact 4B parameter footprint.

est. price

~$0.626

/ 1k images · estimated, set at launch

API providers

downloads / mo

111K

license

apache-2.0

specs

Task	Structured document extraction (charts, tables, key-value pairs)
Architecture	Granite-4.1-3B LLM (3.4B) + 0.6B Vision Encoder and Projectors
Parameters	4B
License	Apache 2.0

about this model

Granite Vision 4.1 4B is a vision-language model (VLM) that delivers frontier-level performance on structured document extraction tasks — chart extraction, table extraction, and semantic key-value pair extraction — in a compact 4B parameter footprint. It is finetuned on top of Granite-4.1-3B, with a 3.4B LLM and 0.6B vision encoder and projectors. The model is developed by IBM Research and released under the Apache 2.0 license.

Supported Extraction Tasks

The model supports specialized extraction tasks activated by simple task tags in the user message, which the chat template automatically expands into full prompts:

Chart extraction: <chart2csv> (CSV table), <chart2code> (Python code), <chart2summary> (natural-language description)


  Table extraction: <tables_json> (structured JSON), <tables_html> (HTML markup), <tables_otsl> (OTSL markup with cell/merge tags)
  Key-Value Pair (KVP) extraction: Schema-based extraction returning JSON with nested dictionaries and arrays



Benchmark Performance
Granite Vision 4.1 4B provides a lightweight alternative to frontier models on structured document extraction benchmarks, delivering comparable performance at a fraction of the parameter count.



Chart Extraction
Evaluated on the human-verified test set from ChartNet (1.5 million chart samples, 24 chart types, 6 plotting libraries), using LLM-as-a-judge (GPT-4o) scoring on Chart2CSV and Chart2Summary tasks.




Table Extraction
Evaluated on a unified suite spanning TableVQA-Extract, OmniDocBench-tables, and PubTablesV2, using TEDS (Tree-Edit Distance-based Similarity) for structural and content similarity. Results are reported separately for cropped-table and full-page settings.





Key-Value Pair Extraction
On the VAREX benchmark (1,777 U.S. government forms, 21,084 evaluation fields), Granite Vision 4.1 4B achieves 94.2% exact-match accuracy (zero-shot, image modality), competitive with much larger frontier models.

Methodology
The model is trained on ChartNet, a million-scale multimodal dataset described in the paper ChartNet: A Million-Scale, High-Quality Multimodal Dataset for Robust Chart Understanding (accepted at CVPR 2026). ChartNet uses a code-guided synthesis pipeline to generate 1.5 million chart samples with five aligned components: plotting code, rendered chart image, data table, natural language summary, and QA with reasoning. The dataset includes specialized subsets for human-annotated data, real-world data, safety, and grounding, with a rigorous quality-filtering pipeline ensuring visual fidelity and semantic accuracy.

The model integrates seamlessly with Docling for enhanced document processing pipelines with deep visual understanding capabilities.

best for

·Converting charts to CSV, Python code, or natural-language summaries
·Extracting tables from document images into JSON, HTML, or OTSL
·Extracting semantic key-value pairs (e.g., invoice fields) using a JSON schema

FAQ

What tasks does Granite Vision 4.1 4B support?

It supports chart extraction (CSV, code, summary), table extraction (JSON, HTML, OTSL), and semantic key-value pair extraction, all via simple task tags.

How do I call this model via the gigarouter API?

Use the gigarouter OpenAI-compatible endpoint with your API key, sending a chat completion request with an image and a task tag prompt.

What is the license for Granite Vision 4.1 4B?

It is released under Apache 2.0, allowing free use, modification, and distribution.

How does this 4B parameter model compare to larger frontier models?

It delivers comparable performance on structured document extraction benchmarks while being much smaller and faster, making it a lightweight alternative.

Can Granite Vision 4.1 4B be integrated with Docling?

Yes, it integrates seamlessly with Docling to enhance document processing pipelines with deep visual understanding capabilities.

not yet live

We're benchmarking and onboarding Granite Vision 4.1 4B as a hosted, OpenAI-compatible API. Sign in for free credit and be ready when it lands, or tell us you want it and we'll prioritize it.

related vision-language models

compare all →

Qwen2.5-VL-7B-Instruct

9.8M dl/mo

Qwen3.6-35B-A3B-FP8

6.2M dl/mo

Qwen2.5-VL-3B-Instruct

5.3M dl/mo

gemma-4-26B-A4B-it-AWQ-4bit