Hosted image-to-text models
37 models · 0 live as APIs · benchmarked & compared
Image-to-text models convert visual information into structured text outputs. They solve a range of problems: optical character recognition (OCR) extracts printed or handwritten text from scanned documents and photos; image captioning generates descriptive text for accessibility, content moderation, or metadata generation; and specialist models handle document layout analysis, orientation detection, or domain-specific inputs such as manga panels. For example, microsoft/trocr-small-handwritten transcribes handwritten notes, while PaddlePaddle/PP-OCRv5_server_det detects and reads text in natural scenes. Others like Salesforce/blip-image-captioning-base produce natural language captions, and numind/NuExtract3 extracts structured data from document images.
In production, these models are typically chained into pipelines. A common pattern is document processing: first detect text regions, then recognize characters, and finally parse the output into actionable fields. Some systems combine orientation detection (PaddlePaddle/PP-LCNet_x1_0_doc_ori) and full document understanding (PaddlePaddle/UVDoc) before extraction. The choice between models involves a trade-off between size, quality, and speed. Smaller models offer lower latency and reduced compute cost but may sacrifice accuracy on noisy or complex inputs. Larger models deliver higher-quality results at the expense of throughput. Domain‑specific models, such as kha-white/manga-ocr-base, can outperform general‑purpose OCR on their target data.
For most call volumes, using a hosted API eliminates the operational burden of managing infrastructure, provisioning GPUs, and handling scaling—while still providing pay‑as‑you‑go flexibility and consistent performance.
compare
| model | params | downloads/mo | price | status |
|---|---|---|---|---|
| Salesforce/blip-image-captioning-base | - | 1.9M | at launch | coming soon |
| Salesforce/blip-image-captioning-large | 469.7M | 752.9K | ~$0.094 / 1k images | coming soon |
| PaddlePaddle/PP-OCRv5_server_det | - | 587.3K | at launch | coming soon |
| numind/NuExtract3 | 4539.3M | 520.7K | ~$1.341 / 1k images | coming soon |
| PaddlePaddle/UVDoc | - | 512.8K | at launch | coming soon |
| microsoft/trocr-small-handwritten | - | 448.6K | at launch | coming soon |
| PaddlePaddle/PP-LCNet_x1_0_doc_ori | - | 445.3K | at launch | coming soon |
| kha-white/manga-ocr-base | - | 389.4K | at launch | coming soon |
| ibm-granite/granite-vision-3.3-2b | 2975.4M | 343.3K | ~$0.626 / 1k images | coming soon |
| PaddlePaddle/PP-LCNet_x1_0_textline_ori | - | 274.6K | at launch | coming soon |
| microsoft/trocr-base-printed | 333.3M | 251.5K | ~$0.094 / 1k images | coming soon |
| lightonai/LightOnOCR-1B-1025 | 1161.2M | 199.9K | ~$0.235 / 1k images | coming soon |
| PaddlePaddle/PP-OCRv5_server_rec | - | 189.4K | at launch | coming soon |
| microsoft/trocr-large-handwritten | - | 182.4K | at launch | coming soon |
| microsoft/kosmos-2-patch14-224 | 1664.5M | 166.7K | ~$0.626 / 1k images | coming soon |
| naver-clova-ix/donut-base | - | 166K | at launch | coming soon |
| microsoft/trocr-base-stage1 | 384.3M | 149K | ~$0.094 / 1k images | coming soon |
| facebook/nougat-base | 348.7M | 145.4K | ~$0.094 / 1k images | coming soon |
| microsoft/trocr-large-printed | 608.1M | 133K | ~$0.235 / 1k images | coming soon |
| PaddlePaddle/PP-OCRv5_mobile_det | - | 129.4K | at launch | coming soon |
| microsoft/trocr-base-handwritten | 333.3M | 124K | ~$0.094 / 1k images | coming soon |
| alibaba-damo/mgp-str-base | 148M | 110.8K | ~$0.047 / 1k images | coming soon |
| PaddlePaddle/PP-OCRv6_medium_det | - | 89K | at launch | coming soon |
| PaddlePaddle/PP-OCRv6_medium_rec | - | 79.9K | at launch | coming soon |
| PaddlePaddle/PP-OCRv5_mobile_rec | - | 74.5K | at launch | coming soon |
| rtr46/meiki.txt.recognition.v0 | - | 65.6K | at launch | coming soon |
| nlpconnect/vit-gpt2-image-captioning | - | 64.4K | at launch | coming soon |
| PaddlePaddle/latin_PP-OCRv5_mobile_rec | - | 37.5K | at launch | coming soon |
| microsoft/trocr-small-printed | 61.4M | 36.3K | ~$0.047 / 1k images | coming soon |
| facebook/nougat-small | 247.4M | 28.5K | ~$0.094 / 1k images | coming soon |
| unsloth/GLM-OCR | - | 28K | at launch | coming soon |
| numind/NuMarkdown-8B-Thinking | 8292.2M | 26.1K | ~$1.341 / 1k images | coming soon |
| PaddlePaddle/en_PP-OCRv4_mobile_rec | - | 24.6K | at launch | coming soon |
| PaddlePaddle/PP-DocLayout_plus-L | - | 21.3K | at launch | coming soon |
| PaddlePaddle/PP-OCRv4_mobile_det | - | 20.1K | at launch | coming soon |
| PaddlePaddle/PP-DocBlockLayout | - | 18.6K | at launch | coming soon |
| tiiuae/Falcon-OCR | 269.9M | 5.1K | ~$0.094 / 1k images | coming soon |