TrOCR Base Stage 1

microsoft/trocr-base-stage1

published Mar 2022 · updated May 2024

TrOCR Base Stage 1 is an image-to-text model for optical character recognition using a Transformer encoder-decoder architecture.

est. price

~$0.094

/ 1k images · estimated, set at launch

API providers

downloads / mo

149K

specs

Task	Image-to-Text (Optical Character Recognition)
Architecture	Vision Encoder-Decoder Transformer
Parameters	334 million
Patch Size	16x16

about this model

microsoft/trocr-base-stage1 is an image-to-text (optical character recognition) model that uses a Transformer encoder-decoder architecture for end-to-end text recognition. The encoder, initialized from BEiT, processes images as 16×16 patches; the decoder, initialized from RoBERTa, generates wordpiece-level tokens autoregressively. This pre-trained-only checkpoint (334M parameters) was introduced in the paper “TrOCR: Transformer-based Optical Character Recognition with Pre-trained Models” (Li et al., AAAI 2023).

Architecture and Pre-training

The model treats OCR as a sequence-to-sequence task without requiring separate language model post-processing. It was pre-trained on large-scale synthetic data and is designed to be fine-tuned on specific printed, handwritten, or scene text recognition datasets. Pre-training uses an input resolution of 384×384, a learning rate of 2e-5, weight decay 0.0001, warmup updates 500, and a batch size of 8.

Key Benchmarks

When fine-tuned for downstream tasks, the TrOCR-Base model achieves the following results:

IAM Handwriting: Cased Character Error Rate of 3.42
SROIE Printed Receipts: F1 score of 96.34
Scene Text Recognition (word accuracy):

Dataset	Accuracy (%)
IIIT5K-3000	93.4
SVT-647	95.2
ICDAR2013-857	98.4
ICDAR2013-1015	97.4
ICDAR2015-1811	86.9
ICDAR2015-2077	81.2
SVTP-645	92.1
CT80-288	90.6

gigarouter hosts this pre-trained checkpoint as a managed API, providing OpenAI-compatible endpoints for image-to-text inference.

best for

·Printed text recognition on document scans
·Handwritten text transcription
·Scene text recognition from natural images

FAQ

What is this model best for?

It is best for optical character recognition on single text-line images, including printed, handwritten, and scene text.

How many parameters does it have?

334 million parameters.

What input format does it require?

Single text-line images; the model expects images of size 384x384 pixels, divided into 16x16 patches.

How can I use this model via the gigarouter API?

Use the gigarouter OpenAI-compatible endpoint with an API key, passing an image as input.

Is this model fine-tuned?

No, this is the pre-trained base model (stage 1). Fine-tuned versions for specific tasks are available on the Hugging Face hub.

not yet live

We're benchmarking and onboarding TrOCR Base Stage 1 as a hosted, OpenAI-compatible API. Sign in for free credit and be ready when it lands, or tell us you want it and we'll prioritize it.

related image-to-text models

compare all →

blip-image-captioning-base

1.9M dl/mo

blip-image-captioning-large

trocr-small-handwritten

448.6K dl/mo