Dots OCR

rednote-hilab/dots.ocr

published Jul 2025 · updated Oct 2025

Dots OCR is a VLM model that unifies layout detection and content recognition for multilingual document parsing using a compact 1.7B-parameter LLM.

est. price

~$0.626

/ 1k images · estimated, set at launch

API providers

downloads / mo

278.6K

license

mit

specs

Task	Document Layout Parsing & OCR
Architecture	Vision-Language Model (VLM) based on a 1.7B LLM
Parameters	1.7B

about this model

dots.ocr is a vision-language model (VLM) for multilingual document layout parsing that unifies layout detection and content recognition within a single 1.7B-parameter LLM foundation, outputting structured JSON with elements sorted in human reading order. It achieves state-of-the-art performance on the OmniDocBench benchmark, with the lowest overall Edit distance scores (0.125 EN, 0.160 ZH), the highest table TEDS scores (88.6 EN, 89.0 ZH), and the best reading order Edit scores (0.040 EN, 0.067 ZH) among all compared methods, including general VLMs like Gemini2.5-Pro and Doubao-1.5. Its formula recognition results (0.329 EN, 0.416 ZH Edit) are competitive with much larger models. On the new XDocParse benchmark spanning 126 languages, dots.ocr achieves state-of-the-art performance with approximately 10% relative improvement over prior methods. The model also demonstrates robust parsing for low-resource languages, achieving decisive advantages on an in-house multilingual benchmark. Its unified architecture allows task switching via prompt changes, providing faster inference than models built on larger foundations. Additional evaluation on olmOCR-Bench yields an overall score of 79.1, with 82.1 on Arxiv Math, 64.2 on Old Scans Math, and 88.3 on Table Tests.

Benchmark Results

End-to-end evaluation on OmniDocBench (lower Edit is better, higher TEDS is better):

Method	Overall (EN)	Overall (ZH)	Text (EN)	Text (ZH)	Formula (EN)	Formula (ZH)	Table (EN)	Table (ZH)	Read Order (EN)	Read Order (ZH)
dots.ocr	0.125	0.160	0.032	0.066	0.329	0.416	88.6	89.0	0.040	0.067

Performance comparison chart of dots.ocr versus competing models on OmniDocBench and multilingual benchmarks.

best for

·Parsing multi-column documents with tables, formulas, and figures in multiple languages
·Extracting structured data from scanned PDFs with correct reading order
·Performing OCR on low-resource language documents

FAQ

What is Dots OCR best for?

It is best for multilingual document layout parsing, unifying detection and recognition of text, tables, formulas, and reading order in a single model.

How does Dots OCR compare to larger models like Gemini 2.5 Pro?

Despite its compact 1.7B size, Dots OCR achieves state-of-the-art performance on OmniDocBench and XDocParse, often matching or exceeding much larger models.

What input and output format does the model accept?

Input is an image and a text prompt specifying the desired layout output; output is a JSON object with bounding boxes, categories, and formatted text (Markdown, HTML, or LaTeX).

How can I call Dots OCR via the API?

Use the GigaRouter OpenAI-compatible endpoint with your API key to send images and prompts, receiving structured JSON responses.

Does Dots OCR support multiple languages?

Yes, it supports 126+ languages including low-resource ones, with robust parsing capabilities demonstrated on an in-house multilingual benchmark.

not yet live

We're benchmarking and onboarding Dots OCR as a hosted, OpenAI-compatible API. Sign in for free credit and be ready when it lands, or tell us you want it and we'll prioritize it.

related vision-language models

compare all →

Qwen2.5-VL-7B-Instruct

9.8M dl/mo

Qwen3.6-35B-A3B-FP8

6.2M dl/mo

Qwen2.5-VL-3B-Instruct

5.3M dl/mo

gemma-4-26B-A4B-it-AWQ-4bit