TrOCR Base Stage 1
microsoft/trocr-base-stage1
published Mar 2022 · updated May 2024
TrOCR Base Stage 1 is an image-to-text model for optical character recognition using a Transformer encoder-decoder architecture.
specs
| Task | Image-to-Text (Optical Character Recognition) |
| Architecture | Vision Encoder-Decoder Transformer |
| Parameters | 334 million |
| Patch Size | 16x16 |
about this model
microsoft/trocr-base-stage1 is an image-to-text (optical character recognition) model that uses a Transformer encoder-decoder architecture for end-to-end text recognition. The encoder, initialized from BEiT, processes images as 16×16 patches; the decoder, initialized from RoBERTa, generates wordpiece-level tokens autoregressively. This pre-trained-only checkpoint (334M parameters) was introduced in the paper “TrOCR: Transformer-based Optical Character Recognition with Pre-trained Models” (Li et al., AAAI 2023).
Architecture and Pre-training
The model treats OCR as a sequence-to-sequence task without requiring separate language model post-processing. It was pre-trained on large-scale synthetic data and is designed to be fine-tuned on specific printed, handwritten, or scene text recognition datasets. Pre-training uses an input resolution of 384×384, a learning rate of 2e-5, weight decay 0.0001, warmup updates 500, and a batch size of 8.
Key Benchmarks
When fine-tuned for downstream tasks, the TrOCR-Base model achieves the following results:
- IAM Handwriting: Cased Character Error Rate of 3.42
- SROIE Printed Receipts: F1 score of 96.34
- Scene Text Recognition (word accuracy):
| Dataset | Accuracy (%) |
|---|---|
| IIIT5K-3000 | 93.4 |
| SVT-647 | 95.2 |
| ICDAR2013-857 | 98.4 |
| ICDAR2013-1015 | 97.4 |
| ICDAR2015-1811 | 86.9 |
| ICDAR2015-2077 | 81.2 |
| SVTP-645 | 92.1 |
| CT80-288 | 90.6 |
gigarouter hosts this pre-trained checkpoint as a managed API, providing OpenAI-compatible endpoints for image-to-text inference.
best for
- ·Printed text recognition on document scans
- ·Handwritten text transcription
- ·Scene text recognition from natural images
FAQ
It is best for optical character recognition on single text-line images, including printed, handwritten, and scene text.
334 million parameters.
Single text-line images; the model expects images of size 384x384 pixels, divided into 16x16 patches.
Use the gigarouter OpenAI-compatible endpoint with an API key, passing an image as input.
No, this is the pre-trained base model (stage 1). Fine-tuned versions for specific tasks are available on the Hugging Face hub.
We're benchmarking and onboarding TrOCR Base Stage 1 as a hosted, OpenAI-compatible API. Sign in for free credit and be ready when it lands, or tell us you want it and we'll prioritize it.