TrOCR Base Printed
microsoft/trocr-base-printed
published Mar 2022 · updated May 2024
TrOCR Base Printed is an image-to-text model for optical character recognition (OCR) on printed text-line images using a Transformer encoder-decoder architecture.
specs
| Task | Optical Character Recognition (OCR) |
| Architecture | Encoder-decoder Transformer (BEiT encoder, RoBERTa decoder) |
| Parameters | 334M |
about this model
microsoft/trocr-base-printed is an image-to-text model that performs optical character recognition (OCR) on printed text-line images using a pure Transformer encoder-decoder architecture, without CNN or RNN components.
Architecture and Training
The model uses a Vision Transformer (initialized from BEiT) as the encoder and a text Transformer (initialized from RoBERTa) as the decoder. Images are processed as 16×16 patches with linear embedding and absolute position encoding, then autoregressively decoded into wordpiece tokens. TrOCR is trained end-to-end: first pre-trained on large-scale synthetic data, then fine-tuned on the SROIE dataset for printed text recognition. The approach was introduced in Li et al. (2021), published at AAAI 2023, with 334 million parameters.
Benchmark Performance
On the SROIE dataset, TrOCR-Base achieves an F1 score of 96.34 (96.60 for the Large variant). The model also demonstrates strong results on scene text recognition benchmarks:
| Dataset | Word Accuracy (%) |
|---|---|
| IIIT5K-3000 | 93.4 |
| SVT-647 | 95.2 |
| ICDAR2013-857 | 98.4 |
| ICDAR2013-1015 | 97.4 |
| ICDAR2015-1811 | 86.9 |
| ICDAR2015-2077 | 81.2 |
| SVTP-645 | 92.1 |
| CT80-288 | 90.6 |
On the IAM handwritten dataset, the model achieves a cased character error rate (CER) of 3.42, confirming its capability beyond printed text.
Key Strengths
- End-to-end Transformer architecture eliminates the need for separate CNN, RNN, and language model post-processing steps.
- Pre-training on synthetic data followed by targeted fine-tuning enables strong generalization across printed, handwritten, and scene text domains.
- Competitive benchmark results against state-of-the-art models at the time of publication.
best for
- ·Extracting printed text from scanned documents
- ·Recognizing text on receipt images (SROIE)
- ·OCR for single text-line images
FAQ
It is optimized for optical character recognition on printed text-line images, especially for scanned documents and receipts.
334 million parameters.
It uses an encoder-decoder Transformer with a BEiT image encoder and a RoBERTa text decoder.
It takes single text-line images (e.g., PNG, JPEG) and outputs recognized text.
Use the gigarouter OpenAI-compatible endpoint with your API key.
We're benchmarking and onboarding TrOCR Base Printed as a hosted, OpenAI-compatible API. Sign in for free credit and be ready when it lands, or tell us you want it and we'll prioritize it.