TrOCR Large Handwritten

microsoft/trocr-large-handwritten

published Mar 2022 · updated May 2024

TrOCR Large Handwritten is an image-to-text model that performs optical character recognition (OCR) on handwritten text-line images using a transformer-based encoder-decoder architecture.

status

coming soon

API providers

downloads / mo

182.4K

specs

Task	Image-to-Text (Optical Character Recognition)
Architecture	Encoder-decoder Transformer (image encoder initialized from BEiT, text decoder from RoBERTa)
Parameters	558 million
License	MIT
Fine-tuned Dataset	IAM Handwriting Database

about this model

microsoft/trocr-large-handwritten is a transformer-based optical character recognition (OCR) model that converts single text-line images into text, using an encoder-decoder architecture with a BEiT image encoder and a RoBERTa text decoder.

Architecture and Capabilities

The model processes images as sequences of 16x16 patches and generates wordpiece-level text autoregressively. With 558 million parameters, it is designed for end-to-end text recognition without a separate language model. It was introduced in the paper "TrOCR: Transformer-based Optical Character Recognition with Pre-trained Models" (AAAI 2023) and is released under the MIT license.

Benchmark Performance

The model achieves a Cased Character Error Rate (CER) of 2.89% on the IAM handwritten test set, outperforming TrOCR-Small (4.22 CER) and TrOCR-Base (3.42 CER). On the SROIE printed text benchmark, it attains an F1 score of 96.60%. For scene text recognition, word accuracies on standard benchmarks are:

Benchmark	Accuracy
IIIT5K-3000	94.1%
SVT-647	96.1%
ICDAR2013-857	98.4%
ICDAR2013-1015	97.3%
ICDAR2015-1811	88.1%
ICDAR2015-2077	84.1%
SVTP-645	93.0%
CT80-288	95.1%

These results demonstrate the model’s effectiveness across handwritten, printed, and scene text recognition tasks.

best for

·Handwritten text recognition from single-line images
·Digitization of historical manuscripts
·Automated form processing with handwritten fields

FAQ

What is this model best for?

It is best for optical character recognition of handwritten text from single-line images, with state-of-the-art accuracy on the IAM benchmark.

How accurate is it on the IAM handwritten test set?

It achieves a Cased Character Error Rate (CER) of 2.89% on the IAM dataset.

What are the input and output formats?

Input: a single text-line image (e.g. JPEG/PNG). Output: a plain text string of recognized characters.

What license does this model use?

It is released under the MIT License.

How do I call this model via the gigarouter API?

Use the gigarouter OpenAI-compatible endpoint with your API key, specifying the model name "TrOCR Large Handwritten" and providing the image in base64 or as a URL.

not yet live

We're benchmarking and onboarding TrOCR Large Handwritten as a hosted, OpenAI-compatible API. Sign in for free credit and be ready when it lands, or tell us you want it and we'll prioritize it.

related image-to-text models

compare all →

blip-image-captioning-base

1.9M dl/mo

blip-image-captioning-large

trocr-small-handwritten

448.6K dl/mo