skip to content
gigarouter gigarouter
models / vision-language · coming soon

MinerU 2.5 1.2B

opendatalab/MinerU2.5-2509-1.2B

published Sep 2025 · updated Apr 2026

MinerU 2.5 1.2B is a vision-language model for efficient high-resolution document parsing.

est. price
~$0.235
/ 1k images · estimated, set at launch
API providers
0
downloads / mo
21.2K
license
agpl-3.0

specs

TaskDocument Parsing / OCR
ArchitectureVision-Language Model (VLM)
Parameters1.2B
LicenseNot specified

about this model

MinerU2.5 is a 1.2B-parameter vision-language model (VLM) for document parsing that achieves state-of-the-art recognition accuracy with high computational efficiency. It employs a coarse-to-fine, two-stage parsing strategy: first performing efficient global layout analysis on downsampled images to identify structural elements, then conducting fine-grained content recognition on native-resolution crops for dense text, complex formulas, and tables. This decoupled approach circumvents the computational overhead of processing high-resolution inputs while preserving fine-grained detail.

Key Strengths

  • Comprehensive layout analysis: Preserves non-body elements such as headers, footers, and page numbers, and uses a refined labeling schema for clearer representation of lists, references, and code blocks.
  • Formula parsing: Handles complex, lengthy mathematical formulae and accurately recognizes mixed-language (Chinese-English) equations.
  • Table parsing robustness: Effectively processes rotated tables, borderless tables, and tables with partial borders.

Benchmark Results

On the olmOCR-bench benchmark, MinerU2.5 achieves an overall score of 75.2, with 76.6 on Arxiv Math, 54.6 on Old Scans Math, and 84.9 on Table Tests. The model is supported by a large-scale, diverse data engine for both pretraining and fine-tuning, enabling robust performance across document types while maintaining low computational overhead.

MinerU2.5 model architecture diagram showing the two-stage coarse-to-fine parsing pipeline

Example document parsing output comparing MinerU2.5 results with ground truth

best for

FAQ

What is MinerU 2.5 1.2B best used for?

It is designed for efficient document parsing, including layout analysis, text recognition, formula parsing, and table extraction from high-resolution documents.

What is the model architecture and size?

It is a 1.2B parameter vision-language model (VLM) employing a coarse-to-fine two-stage parsing strategy.

How does it perform on benchmarks?

On the olmOCR-bench benchmark, it achieves an overall score of 75.2, with a table parsing score of 84.9.

What are the input and output formats?

Input is an image (document page); output is structured text with layout elements (paragraphs, formulas, tables). Via the gigarouter API, send the image as a base64 string or URL and receive JSON or text.

How can I call this model via the API?

Use the gigarouter OpenAI-compatible endpoint with your API key. Provide the image as a URL or base64 in a chat completion request.

not yet live

We're benchmarking and onboarding MinerU 2.5 1.2B as a hosted, OpenAI-compatible API. Sign in for free credit and be ready when it lands, or tell us you want it and we'll prioritize it.

related vision-language models

compare all →