InternVL2 2B
OpenGVLab/InternVL2-2B
published Jun 2024 · updated Mar 2025
InternVL2 2B is a multimodal large language model that integrates a vision encoder and language model to perform visual question answering, document understanding, OCR, and multimodal reasoning.
specs
| Task | Visual Language Model (Multimodal LLM) |
| Architecture | InternViT-300M-448px vision encoder + MLP projector + internlm2-chat-1_8b language model |
| Parameters | 2.2B total |
about this model
InternVL2-2B is a vision-language model (VLM) that performs multimodal understanding tasks including document and chart comprehension, infographics QA, scene text recognition, OCR, scientific and mathematical reasoning, and video analysis. It is part of the InternVL 2.0 series, which consists of instruction-tuned models ranging from 1B to 108B parameters. InternVL2-2B uses a vision encoder (InternViT-300M-448px), an MLP projector, and the internlm2-chat-1_8b language model, totaling 2.2B parameters.
Key Strengths
- Strong performance on OCR and document understanding: achieves 86.9 on DocVQA, 76.2 on ChartQA, 58.9 on InfoVQA, and 73.4 on TextVQA.
- Top-tier OCRBench score of 784, significantly outperforming Mini-InternVL-2B-1-5 (654) and PaliGemma-3B (614).
- Competitive video understanding: scores 60.2 on MVBench and 45.0 on Video-MME (without subtitles), matching or exceeding models with larger parameter counts.
- Supports dynamic resolution with up to 12 tiles of 448×448 during training and up to 40 tiles (4K resolution) during testing, enabling high-detail image analysis.
- Trained with an 8k context window and data including long texts, multiple images, and videos, improving multi-image and video handling over InternVL 1.5.
Benchmark Performance
The table below compares InternVL2-2B against similarly sized models on key image benchmarks:
| Benchmark | PaliGemma-3B | Mini-InternVL-2B-1-5 | InternVL2-2B |
|---|---|---|---|
| DocVQA | - | 85.0 | 86.9 |
| ChartQA | - | 74.8 | 76.2 |
| InfoVQA | - | 55.4 | 58.9 |
| TextVQA | 68.1 | 70.5 | 73.4 |
| OCRBench | 614 | 654 | 784 |
| MME | 1686.1 | 1901.5 | 1876.8 |
| MMBench-EN | 71.0 | 70.9 | 73.2 |
| MathVista | 28.7 | 41.1 | 46.3 |
| OpenCompass | 46.6 | 49.8 | 54.0 |
On video benchmarks, InternVL2-2B scores 60.2 on MVBench and 45.0 on Video-MME (without subtitles), competitive with models twice its size. For grounding tasks, it achieves an average of 77.7 across RefCOCO variants.

best for
- ·Document and chart comprehension (DocVQA, ChartQA)
- ·Scene text recognition and OCR (OCRBench, TextVQA)
- ·Scientific and mathematical problem solving (MathVista)
- ·Multimodal reasoning with images and videos
FAQ
It excels at document and chart understanding, OCR, scientific reasoning, and multimodal question answering with both images and videos.
It supports an 8k context window, enabling processing of long text inputs alongside images.
It achieves 90% of the performance of much larger models while using only 5% of the parameters, as shown in the Mini-InternVL paper.
It accepts images and text; you provide an image and a prompt, and it returns generated text. The gigarouter API uses the OpenAI-compatible chat completions format.
Use the OpenAI-compatible endpoint with your API key, setting the model name to 'internvl2-2b' and sending a chat completion request with image URLs or base64-encoded images.
We're benchmarking and onboarding InternVL2 2B as a hosted, OpenAI-compatible API. Sign in for free credit and be ready when it lands, or tell us you want it and we'll prioritize it.