MGP-STR Base

alibaba-damo/mgp-str-base

published Nov 2022 · updated Dec 2023

MGP-STR Base is a image-to-text model that performs scene text recognition using a pure vision Transformer with multi-granularity prediction.

est. price

~$0.047

/ 1k images · estimated, set at launch

API providers

downloads / mo

110.8K

specs

Task	Scene Text Recognition (Image-to-Text)
Architecture	Vision Transformer (ViT) with A^3 modules and multi-granularity prediction (character, subword, word)
Training Data	MJSynth and SynthText
License	Apache 2.0

about this model

MGP-STR (base-sized) is an image-to-text model for scene text recognition that uses a pure vision transformer (ViT) with multi-granularity prediction to decode text from images without a separate language model.

Architecture and approach

The model processes 32x128 pixel input images as a sequence of 4x4 patches. A ViT backbone, initialised from DeiT-base weights, extracts visual features. Specially designed A modules then select and combine informative token representations for each character position. In addition to character-level predictions, the model uses subword classification heads based on BPE and WordPiece tokenisation, implicitly modelling language information. The outputs from all three granularities (character, subword, word) are fused to produce the final text transcription.

Performance and training

Trained on the MJSynth and SynthText datasets, MGP-STR achieves an average recognition accuracy of 93.35% across standard scene-text benchmarks (IC13, IC15, IIIT5K, SVT, SVTP, CUTE80 and others). The model was introduced in a paper accepted at ECCV 2022 and is released under the Apache 2.0 license. Gigarouter hosts MGP-STR as a managed, OpenAI-compatible API, eliminating the need for local infrastructure.

The following image is a typical input from the IIIT-5K dataset that the model would transcribe:

best for

·Recognizing text in natural scene images
·Extracting text from photos for document digitization
·OCR on challenging, unconstrained backgrounds

FAQ

What accuracy does MGP-STR Base achieve?

It achieves an average recognition accuracy of 93.35% on standard benchmarks.

What input size does the model expect?

Input images should be 32x128 pixels, presented as patches of 4x4.

What is the model's license?

The model is released under the Apache License 2.0.

How can I call this model via the gigarouter API?

Use the gigarouter OpenAI-compatible endpoint with your API key, sending the image as a base64-encoded string or URL.

Was this model published in a research paper?

Yes, it was introduced in the paper "Multi-Granularity Prediction for Scene Text Recognition" at ECCV 2022.

not yet live

We're benchmarking and onboarding MGP-STR Base as a hosted, OpenAI-compatible API. Sign in for free credit and be ready when it lands, or tell us you want it and we'll prioritize it.

related image-to-text models

compare all →

blip-image-captioning-base

1.9M dl/mo

blip-image-captioning-large

trocr-small-handwritten

448.6K dl/mo