PP-OCRv6 Medium Det

PaddlePaddle/PP-OCRv6_medium_det

published Jun 2026 · updated Jun 2026

PP-OCRv6 Medium Det is an image-to-text model that detects and localizes text regions in images using a lightweight MetaFormer-based architecture.

status

coming soon

API providers

downloads / mo

89K

license

apache-2.0

specs

Task	Text Detection
Architecture	MetaFormer (LCNetV4 backbone + RepLKFPN neck)
Parameters	15.5M
License	Apache 2.0

about this model

PP-OCRv6_medium_det is a lightweight text detection model from the PP-OCRv6 family, designed for accurate and efficient localization of text in images. It is built on a unified MetaFormer-style architecture with a LCNetV4 backbone and RepLKFPN feature pyramid neck, using structural reparameterization to decouple spatial and channel mixing. With 15.5 million parameters, it targets server and cloud deployment scenarios.

Benchmark Performance

On the PP-OCRv6 in-house benchmark, the medium tier achieves an average detection Hmean of 86.2%, surpassing PP-OCRv5_server (81.6%) by +4.6% and PP-OCRv5_mobile (75.2%) by +11.0%. The model dramatically outperforms billion-scale vision-language models on the same detection benchmark: Gemini-3.1-Pro (46.8%), GPT-5.5 (45.6%), Qwen3-VL-235B (38.3%), Kimi-K2.6 (12.8%), and MiniMax-M3 (12.0%). Recognition accuracy for the full OCR system is 83.2%.

Per-Category Detection Hmean

Category	Hmean (%)
Handwritten CN	83.7
Handwritten EN	84.0
Printed CN	95.1
Printed EN	93.7
Traditional Chinese	86.3
Ancient Text	80.2
Japanese	84.3
Blur	94.1
Emoji	99.6
Warp	88.6
Pinyin	74.0
Artistic	69.0
Table	96.8
Rotation	93.8
Industrial	73.3
General	82.8

Architecture and Deployment

The model shares block primitives across medium, small, and tiny tiers, with task-specific stride configurations. The tiny tier achieves 3.9× faster inference than PP-OCRv5_mobile on Intel Xeon CPU while maintaining comparable accuracy. The medium_det variant is licensed under Apache 2.0. A detailed paper is available at arXiv:2606.13108.

best for

·Extracting text from scanned documents
·Detecting text in natural scene images
·OCR for multi-lingual documents (Chinese, English, Japanese, etc.)

FAQ

What is PP-OCRv6 Medium Det best used for?

It excels at bounding-box-level text detection in images, supporting printed, handwritten, and multi-lingual text across diverse scenarios.

How does it compare to billion-scale VLMs on detection tasks?

PP-OCRv6 Medium Det (86.2% Hmean) dramatically surpasses Gemini-3.1-Pro (46.8%), GPT-5.5 (45.6%), and Qwen3-VL-235B (38.3%) with only 15.5M parameters.

What are the input and output formats?

Input: image (JPEG/PNG). Output: JSON array of bounding boxes with confidence scores and text (if combined with recognition).

How can I call this model via the gigarouter API?

Use the OpenAI-compatible endpoint with your API key; pass the image as a base64-encoded string or URL in the request.

What is the license for PP-OCRv6 Medium Det?

It is released under the Apache 2.0 license, allowing commercial use, modification, and distribution with attribution.

not yet live

We're benchmarking and onboarding PP-OCRv6 Medium Det as a hosted, OpenAI-compatible API. Sign in for free credit and be ready when it lands, or tell us you want it and we'll prioritize it.

related image-to-text models

compare all →

blip-image-captioning-base

1.9M dl/mo

blip-image-captioning-large

trocr-small-handwritten

448.6K dl/mo