Question 1

What document intelligence tasks can Qianfan-OCR perform?

Accepted Answer

It supports document parsing (image-to-Markdown), layout analysis, table extraction, formula recognition, chart understanding, key information extraction, handwriting recognition, scene text recognition, and multilingual OCR in 192 languages.

Question 2

How does Qianfan-OCR compare to traditional multi-stage OCR pipelines?

Accepted Answer

It is a single end-to-end model that replaces chained detection, recognition, and LLM modules, achieving higher accuracy and lower latency on benchmarks like OmniDocBench v1.5 (93.12) and OCRBench (880).

Question 3

What is Layout-as-Thought and when should I use it?

Accepted Answer

Layout-as-Thought is an optional thinking phase triggered by special tokens where the model generates bounding boxes, element types, and reading order before the final output. Use it for complex or heterogeneous documents (e.g., exam papers, newspapers) to improve accuracy; disable it for simple layouts to reduce latency.

Question 4

What input and output formats does Qianfan-OCR support?

Accepted Answer

Input: images (PNG/JPG) with a text prompt. Output: Markdown, JSON, HTML (via prompt control). For key information extraction, output can be structured JSON.

Question 5

How can I call Qianfan-OCR via the gigarouter API?

Accepted Answer

Use the OpenAI-compatible endpoint provided by gigarouter. Pass an image URL or base64-encoded image, a text prompt, and your API key. The model returns the generated text.

Task	Document Intelligence / OCR
Architecture	Qianfan-ViT vision encoder + Qwen3-4B language model with cross-modal MLP adapter
Parameters	4B (3.6B non-embedding)
License	Apache 2.0

Qianfan-OCR

specs

about this model

Key Strengths

Benchmark Results

Architecture Overview

best for

FAQ

related vision-language models