skip to content
gigarouter gigarouter
models / vision-language · coming soon

Qianfan-OCR

baidu/Qianfan-OCR

published Mar 2026 · updated Apr 2026

Qianfan-OCR is a 4B-parameter end-to-end vision-language model that unifies document parsing, layout analysis, and document understanding within a single architecture.

est. price
~$1.341
/ 1k images · estimated, set at launch
API providers
0
downloads / mo
258.6K
license
apache-2.0

specs

TaskDocument Intelligence / OCR
ArchitectureQianfan-ViT vision encoder + Qwen3-4B language model with cross-modal MLP adapter
Parameters4B (3.6B non-embedding)
LicenseApache 2.0

about this model

Qianfan-OCR is a 4B-parameter end-to-end vision-language model that unifies document parsing, layout analysis, and document understanding within a single architecture. It performs direct image-to-Markdown conversion and supports prompt-driven tasks including table extraction, chart understanding, document question answering, and key information extraction.

Key Strengths

The model introduces Layout-as-Thought, an optional thinking phase triggered by special ⟨think⟩ tokens. This mechanism generates structured layout representations — bounding boxes, element types, and reading order — before producing final outputs, recovering layout analysis capability and improving accuracy on complex documents. Qianfan-OCR supports 192 languages and achieves an inference throughput of 1.024 pages per second on a single A100 GPU with W8A8 quantization.

Benchmark Results

Qianfan-OCR ranks first among end-to-end models on OmniDocBench v1.5 with an overall score of 93.12, surpassing DeepSeek-OCR-v2 (91.09), Gemini-3 Pro (90.33), and Qwen3-VL-235B (89.15). It also achieves the first position among end-to-end models on OlmOCR Bench (79.8). On OCRBench the model scores 880, the highest overall across all models. For key information extraction, it attains an average score of 87.9 across five public KIE benchmarks, outperforming Gemini-3.1-Pro (79.2), Seed-2.0, and Qwen3-VL-235B-A22B (84.2). The model was trained on 1,024 Kunlun P800 chips processing 2.85 trillion tokens across four training stages.

Architecture Overview

The model combines a Qianfan-ViT vision encoder with AnyResolution design (up to 4K, max 4,096 visual tokens), a Qwen3-4B language backbone (3.6B non-embedding, 32K context extendable to 131K), and a two-layer MLP cross-modal adapter. All components are optimized for production document intelligence workloads.

best for

FAQ

What document intelligence tasks can Qianfan-OCR perform?

It supports document parsing (image-to-Markdown), layout analysis, table extraction, formula recognition, chart understanding, key information extraction, handwriting recognition, scene text recognition, and multilingual OCR in 192 languages.

How does Qianfan-OCR compare to traditional multi-stage OCR pipelines?

It is a single end-to-end model that replaces chained detection, recognition, and LLM modules, achieving higher accuracy and lower latency on benchmarks like OmniDocBench v1.5 (93.12) and OCRBench (880).

What is Layout-as-Thought and when should I use it?

Layout-as-Thought is an optional thinking phase triggered by special tokens where the model generates bounding boxes, element types, and reading order before the final output. Use it for complex or heterogeneous documents (e.g., exam papers, newspapers) to improve accuracy; disable it for simple layouts to reduce latency.

What input and output formats does Qianfan-OCR support?

Input: images (PNG/JPG) with a text prompt. Output: Markdown, JSON, HTML (via prompt control). For key information extraction, output can be structured JSON.

How can I call Qianfan-OCR via the gigarouter API?

Use the OpenAI-compatible endpoint provided by gigarouter. Pass an image URL or base64-encoded image, a text prompt, and your API key. The model returns the generated text.

not yet live

We're benchmarking and onboarding Qianfan-OCR as a hosted, OpenAI-compatible API. Sign in for free credit and be ready when it lands, or tell us you want it and we'll prioritize it.

related vision-language models

compare all →