skip to content
gigarouter gigarouter
models / image-to-text · coming soon

TrOCR Base Stage 1

microsoft/trocr-base-stage1

published Mar 2022 · updated May 2024

TrOCR Base Stage 1 is an image-to-text model for optical character recognition using a Transformer encoder-decoder architecture.

est. price
~$0.094
/ 1k images · estimated, set at launch
API providers
0
downloads / mo
149K

specs

TaskImage-to-Text (Optical Character Recognition)
ArchitectureVision Encoder-Decoder Transformer
Parameters334 million
Patch Size16x16

about this model

microsoft/trocr-base-stage1 is an image-to-text (optical character recognition) model that uses a Transformer encoder-decoder architecture for end-to-end text recognition. The encoder, initialized from BEiT, processes images as 16×16 patches; the decoder, initialized from RoBERTa, generates wordpiece-level tokens autoregressively. This pre-trained-only checkpoint (334M parameters) was introduced in the paper “TrOCR: Transformer-based Optical Character Recognition with Pre-trained Models” (Li et al., AAAI 2023).

Architecture and Pre-training

The model treats OCR as a sequence-to-sequence task without requiring separate language model post-processing. It was pre-trained on large-scale synthetic data and is designed to be fine-tuned on specific printed, handwritten, or scene text recognition datasets. Pre-training uses an input resolution of 384×384, a learning rate of 2e-5, weight decay 0.0001, warmup updates 500, and a batch size of 8.

Key Benchmarks

When fine-tuned for downstream tasks, the TrOCR-Base model achieves the following results:

  • IAM Handwriting: Cased Character Error Rate of 3.42
  • SROIE Printed Receipts: F1 score of 96.34
  • Scene Text Recognition (word accuracy):
Dataset Accuracy (%)
IIIT5K-300093.4
SVT-64795.2
ICDAR2013-85798.4
ICDAR2013-101597.4
ICDAR2015-181186.9
ICDAR2015-207781.2
SVTP-64592.1
CT80-28890.6

gigarouter hosts this pre-trained checkpoint as a managed API, providing OpenAI-compatible endpoints for image-to-text inference.

best for

FAQ

What is this model best for?

It is best for optical character recognition on single text-line images, including printed, handwritten, and scene text.

How many parameters does it have?

334 million parameters.

What input format does it require?

Single text-line images; the model expects images of size 384x384 pixels, divided into 16x16 patches.

How can I use this model via the gigarouter API?

Use the gigarouter OpenAI-compatible endpoint with an API key, passing an image as input.

Is this model fine-tuned?

No, this is the pre-trained base model (stage 1). Fine-tuned versions for specific tasks are available on the Hugging Face hub.

not yet live

We're benchmarking and onboarding TrOCR Base Stage 1 as a hosted, OpenAI-compatible API. Sign in for free credit and be ready when it lands, or tell us you want it and we'll prioritize it.

related image-to-text models

compare all →