MGP-STR Base
alibaba-damo/mgp-str-base
published Nov 2022 · updated Dec 2023
MGP-STR Base is a image-to-text model that performs scene text recognition using a pure vision Transformer with multi-granularity prediction.
specs
| Task | Scene Text Recognition (Image-to-Text) |
| Architecture | Vision Transformer (ViT) with A^3 modules and multi-granularity prediction (character, subword, word) |
| Training Data | MJSynth and SynthText |
| License | Apache 2.0 |
about this model
MGP-STR (base-sized) is an image-to-text model for scene text recognition that uses a pure vision transformer (ViT) with multi-granularity prediction to decode text from images without a separate language model.
Architecture and approach
The model processes 32x128 pixel input images as a sequence of 4x4 patches. A ViT backbone, initialised from DeiT-base weights, extracts visual features. Specially designed A modules then select and combine informative token representations for each character position. In addition to character-level predictions, the model uses subword classification heads based on BPE and WordPiece tokenisation, implicitly modelling language information. The outputs from all three granularities (character, subword, word) are fused to produce the final text transcription.
Performance and training
Trained on the MJSynth and SynthText datasets, MGP-STR achieves an average recognition accuracy of 93.35% across standard scene-text benchmarks (IC13, IC15, IIIT5K, SVT, SVTP, CUTE80 and others). The model was introduced in a paper accepted at ECCV 2022 and is released under the Apache 2.0 license. Gigarouter hosts MGP-STR as a managed, OpenAI-compatible API, eliminating the need for local infrastructure.
The following image is a typical input from the IIIT-5K dataset that the model would transcribe:
best for
- ·Recognizing text in natural scene images
- ·Extracting text from photos for document digitization
- ·OCR on challenging, unconstrained backgrounds
FAQ
It achieves an average recognition accuracy of 93.35% on standard benchmarks.
Input images should be 32x128 pixels, presented as patches of 4x4.
The model is released under the Apache License 2.0.
Use the gigarouter OpenAI-compatible endpoint with your API key, sending the image as a base64-encoded string or URL.
Yes, it was introduced in the paper "Multi-Granularity Prediction for Scene Text Recognition" at ECCV 2022.
We're benchmarking and onboarding MGP-STR Base as a hosted, OpenAI-compatible API. Sign in for free credit and be ready when it lands, or tell us you want it and we'll prioritize it.