Manga OCR Base

kha-white/manga-ocr-base

published Mar 2022 · updated Jun 2022

Manga OCR Base is an image-to-text model that performs optical character recognition for Japanese text, with the main focus being Japanese manga.

status

coming soon

API providers

downloads / mo

389.4K

license

apache-2.0

specs

Task	Image-to-Text (OCR)
Architecture	Vision Encoder-Decoder (ViT + text decoder)
Input	Image (JPEG, PNG, etc.)
Output	Japanese text (supports multi-line)

about this model

Manga OCR is an optical character recognition model for Japanese text, specialized for printed text in manga and other image-heavy contexts. It uses a Vision Encoder Decoder architecture (Transformers framework) and is designed to handle the unique challenges of manga: vertical and horizontal text orientation, furigana, text overlaid on images, a wide variety of fonts and font styles, and low‑quality images. Unlike many OCR models, it supports recognizing multi‑line text in a single forward pass, allowing entire text bubbles to be processed without line splitting.

Key Capabilities

Robust recognition of both vertical and horizontal Japanese text.
Accurate handling of furigana (ruby annotations) and mixed‑script text.
Works directly on text overlaid on complex backgrounds, common in manga panels.
Performs well across diverse font families and degraded image quality.
End‑to‑end pipeline: accepts full images or cropped regions and outputs recognized text.

This model is hosted by Gigarouter as a managed, OpenAI‑compatible API. The underlying code and training details are available in the official repository.

best for

·OCR of Japanese manga text bubbles
·Reading vertical and horizontal Japanese text
·Extracting text from low-quality or overlaid images

FAQ

What is Manga OCR Base best for?

It is optimized for Japanese text recognition in manga, handling vertical/horizontal text, furigana, and poor image quality.

Does it support multi-line text in a single pass?

Yes, it can recognize multi-line text from a single forward pass, ideal for processing entire text bubbles at once.

What input formats are accepted?

It accepts image files (e.g., JPEG, PNG) or PIL Image objects via the Python API.

How can I call this model via the gigarouter API?

Use the OpenAI-compatible endpoint with your API key, sending an image URL or base64-encoded image in the request.

What is the model architecture?

It uses the Vision Encoder-Decoder framework from Hugging Face Transformers, combining a vision encoder with a text decoder.

not yet live

We're benchmarking and onboarding Manga OCR Base as a hosted, OpenAI-compatible API. Sign in for free credit and be ready when it lands, or tell us you want it and we'll prioritize it.

related image-to-text models

compare all →

blip-image-captioning-base

1.9M dl/mo

blip-image-captioning-large

trocr-small-handwritten

448.6K dl/mo