TrOCR Base Printed

microsoft/trocr-base-printed

published Mar 2022 · updated May 2024

TrOCR Base Printed is an image-to-text model for optical character recognition (OCR) on printed text-line images using a Transformer encoder-decoder architecture.

est. price

~$0.094

/ 1k images · estimated, set at launch

API providers

downloads / mo

251.5K

specs

Task	Optical Character Recognition (OCR)
Architecture	Encoder-decoder Transformer (BEiT encoder, RoBERTa decoder)
Parameters	334M

about this model

microsoft/trocr-base-printed is an image-to-text model that performs optical character recognition (OCR) on printed text-line images using a pure Transformer encoder-decoder architecture, without CNN or RNN components.

Architecture and Training

The model uses a Vision Transformer (initialized from BEiT) as the encoder and a text Transformer (initialized from RoBERTa) as the decoder. Images are processed as 16×16 patches with linear embedding and absolute position encoding, then autoregressively decoded into wordpiece tokens. TrOCR is trained end-to-end: first pre-trained on large-scale synthetic data, then fine-tuned on the SROIE dataset for printed text recognition. The approach was introduced in Li et al. (2021), published at AAAI 2023, with 334 million parameters.

Benchmark Performance

On the SROIE dataset, TrOCR-Base achieves an F1 score of 96.34 (96.60 for the Large variant). The model also demonstrates strong results on scene text recognition benchmarks:

Dataset	Word Accuracy (%)
IIIT5K-3000	93.4
SVT-647	95.2
ICDAR2013-857	98.4
ICDAR2013-1015	97.4
ICDAR2015-1811	86.9
ICDAR2015-2077	81.2
SVTP-645	92.1
CT80-288	90.6

On the IAM handwritten dataset, the model achieves a cased character error rate (CER) of 3.42, confirming its capability beyond printed text.

Key Strengths

End-to-end Transformer architecture eliminates the need for separate CNN, RNN, and language model post-processing steps.
Pre-training on synthetic data followed by targeted fine-tuning enables strong generalization across printed, handwritten, and scene text domains.
Competitive benchmark results against state-of-the-art models at the time of publication.

best for

·Extracting printed text from scanned documents
·Recognizing text on receipt images (SROIE)
·OCR for single text-line images

FAQ

What is TrOCR Base Printed best for?

It is optimized for optical character recognition on printed text-line images, especially for scanned documents and receipts.

How many parameters does it have?

334 million parameters.

What architecture does it use?

It uses an encoder-decoder Transformer with a BEiT image encoder and a RoBERTa text decoder.

What input format does it require?

It takes single text-line images (e.g., PNG, JPEG) and outputs recognized text.

How can I call this model via API?

Use the gigarouter OpenAI-compatible endpoint with your API key.

not yet live

We're benchmarking and onboarding TrOCR Base Printed as a hosted, OpenAI-compatible API. Sign in for free credit and be ready when it lands, or tell us you want it and we'll prioritize it.

related image-to-text models

compare all →

blip-image-captioning-base

1.9M dl/mo

blip-image-captioning-large

trocr-small-handwritten

448.6K dl/mo