InternVL2 2B

OpenGVLab/InternVL2-2B

published Jun 2024 · updated Mar 2025

InternVL2 2B is a multimodal large language model that integrates a vision encoder and language model to perform visual question answering, document understanding, OCR, and multimodal reasoning.

est. price

~$0.626

/ 1k images · estimated, set at launch

API providers

downloads / mo

1.5M

license

mit

specs

Task	Visual Language Model (Multimodal LLM)
Architecture	InternViT-300M-448px vision encoder + MLP projector + internlm2-chat-1_8b language model
Parameters	2.2B total

about this model

InternVL2-2B is a vision-language model (VLM) that performs multimodal understanding tasks including document and chart comprehension, infographics QA, scene text recognition, OCR, scientific and mathematical reasoning, and video analysis. It is part of the InternVL 2.0 series, which consists of instruction-tuned models ranging from 1B to 108B parameters. InternVL2-2B uses a vision encoder (InternViT-300M-448px), an MLP projector, and the internlm2-chat-1_8b language model, totaling 2.2B parameters.

Key Strengths

Strong performance on OCR and document understanding: achieves 86.9 on DocVQA, 76.2 on ChartQA, 58.9 on InfoVQA, and 73.4 on TextVQA.
Top-tier OCRBench score of 784, significantly outperforming Mini-InternVL-2B-1-5 (654) and PaliGemma-3B (614).
Competitive video understanding: scores 60.2 on MVBench and 45.0 on Video-MME (without subtitles), matching or exceeding models with larger parameter counts.
Supports dynamic resolution with up to 12 tiles of 448×448 during training and up to 40 tiles (4K resolution) during testing, enabling high-detail image analysis.
Trained with an 8k context window and data including long texts, multiple images, and videos, improving multi-image and video handling over InternVL 1.5.

Benchmark Performance

The table below compares InternVL2-2B against similarly sized models on key image benchmarks:

Benchmark	PaliGemma-3B	Mini-InternVL-2B-1-5	InternVL2-2B
DocVQA	-	85.0	86.9
ChartQA	-	74.8	76.2
InfoVQA	-	55.4	58.9
TextVQA	68.1	70.5	73.4
OCRBench	614	654	784
MME	1686.1	1901.5	1876.8
MMBench-EN	71.0	70.9	73.2
MathVista	28.7	41.1	46.3
OpenCompass	46.6	49.8	54.0

On video benchmarks, InternVL2-2B scores 60.2 on MVBench and 45.0 on Video-MME (without subtitles), competitive with models twice its size. For grounding tasks, it achieves an average of 77.7 across RefCOCO variants.

InternVL2 model architecture diagram showing the vision encoder, MLP projector, and language model components

best for

·Document and chart comprehension (DocVQA, ChartQA)
·Scene text recognition and OCR (OCRBench, TextVQA)
·Scientific and mathematical problem solving (MathVista)
·Multimodal reasoning with images and videos

FAQ

What is InternVL2 2B best for?

It excels at document and chart understanding, OCR, scientific reasoning, and multimodal question answering with both images and videos.

What is the context window of InternVL2 2B?

It supports an 8k context window, enabling processing of long text inputs alongside images.

How does InternVL2 2B compare to larger models in terms of performance?

It achieves 90% of the performance of much larger models while using only 5% of the parameters, as shown in the Mini-InternVL paper.

What input formats does InternVL2 2B accept?

It accepts images and text; you provide an image and a prompt, and it returns generated text. The gigarouter API uses the OpenAI-compatible chat completions format.

How can I call InternVL2 2B via the gigarouter API?

Use the OpenAI-compatible endpoint with your API key, setting the model name to 'internvl2-2b' and sending a chat completion request with image URLs or base64-encoded images.

not yet live

We're benchmarking and onboarding InternVL2 2B as a hosted, OpenAI-compatible API. Sign in for free credit and be ready when it lands, or tell us you want it and we'll prioritize it.

related vision-language models

compare all →

Qwen2.5-VL-7B-Instruct

9.8M dl/mo

Qwen3.6-35B-A3B-FP8

6.2M dl/mo

Qwen2.5-VL-3B-Instruct

5.3M dl/mo

gemma-4-26B-A4B-it-AWQ-4bit