skip to content
gigarouter gigarouter
models / image-to-text · coming soon

MGP-STR Base

alibaba-damo/mgp-str-base

published Nov 2022 · updated Dec 2023

MGP-STR Base is a image-to-text model that performs scene text recognition using a pure vision Transformer with multi-granularity prediction.

est. price
~$0.047
/ 1k images · estimated, set at launch
API providers
0
downloads / mo
110.8K

specs

TaskScene Text Recognition (Image-to-Text)
ArchitectureVision Transformer (ViT) with A^3 modules and multi-granularity prediction (character, subword, word)
Training DataMJSynth and SynthText
LicenseApache 2.0

about this model

MGP-STR (base-sized) is an image-to-text model for scene text recognition that uses a pure vision transformer (ViT) with multi-granularity prediction to decode text from images without a separate language model.

Architecture and approach

The model processes 32x128 pixel input images as a sequence of 4x4 patches. A ViT backbone, initialised from DeiT-base weights, extracts visual features. Specially designed A modules then select and combine informative token representations for each character position. In addition to character-level predictions, the model uses subword classification heads based on BPE and WordPiece tokenisation, implicitly modelling language information. The outputs from all three granularities (character, subword, word) are fused to produce the final text transcription.

Performance and training

Trained on the MJSynth and SynthText datasets, MGP-STR achieves an average recognition accuracy of 93.35% across standard scene-text benchmarks (IC13, IC15, IIIT5K, SVT, SVTP, CUTE80 and others). The model was introduced in a paper accepted at ECCV 2022 and is released under the Apache 2.0 license. Gigarouter hosts MGP-STR as a managed, OpenAI-compatible API, eliminating the need for local infrastructure.

The following image is a typical input from the IIIT-5K dataset that the model would transcribe:

best for

FAQ

What accuracy does MGP-STR Base achieve?

It achieves an average recognition accuracy of 93.35% on standard benchmarks.

What input size does the model expect?

Input images should be 32x128 pixels, presented as patches of 4x4.

What is the model's license?

The model is released under the Apache License 2.0.

How can I call this model via the gigarouter API?

Use the gigarouter OpenAI-compatible endpoint with your API key, sending the image as a base64-encoded string or URL.

Was this model published in a research paper?

Yes, it was introduced in the paper "Multi-Granularity Prediction for Scene Text Recognition" at ECCV 2022.

not yet live

We're benchmarking and onboarding MGP-STR Base as a hosted, OpenAI-compatible API. Sign in for free credit and be ready when it lands, or tell us you want it and we'll prioritize it.

related image-to-text models

compare all →