Sarashina 2.2 TTS
sbintuitions/sarashina2.2-tts
published Apr 2026 · updated Jul 2026
Sarashina 2.2 TTS is a Japanese-centric text-to-speech system built on a large language model, supporting Japanese and English with zero-shot voice generation.
specs
| Task | Text-to-Speech (TTS) |
| Architecture | LLM-based TTS |
| Training Data Scale | ~361k hours of speech |
| Languages | Japanese and English |
about this model
Sarashina2.2-TTS is a Japanese-centric text-to-speech model built on a large language model, developed by SB Intuitions. It synthesizes natural, expressive speech in Japanese and English, supporting zero-shot voice generation from a short reference clip across diverse speaking styles including narration, broadcast, conversation, and customer service.
Training and Performance
Sarashina2.2-TTS was trained on approximately 361,000 hours of speech with a balanced mix of Japanese and English data. The model achieves state-of-the-art kanji-level reading accuracy on the newly introduced Joyo Kanji Yomi Benchmark (covering all 2,136 Joyo kanji and their 4,378 readings) and delivers the highest speaker similarity in zero-shot Japanese speech synthesis. It matches top baselines on general sentence-level pronunciation. A dedicated metric, Kana-CER, compares synthesized speech against reference readings in the kana space to directly measure pronunciation correctness.
Cross-Lingual and Code-Switching Robustness
Sarashina2.2-TTS is the only system that maintains stable Japanese pronunciation regardless of the prompt language, demonstrating strong cross-lingual robustness. It handles code-switching within a single utterance naturally, preserving speaker identity across languages.
Audio Samples
The samples below demonstrate zero-shot speaker adaptation, diverse speaking styles, and English generation capabilities. Additional samples for style variety, zero-shot voice cloning, cross-lingual generation, and code-switching are available on the model card.
| Zero-shot Speaker Adaptation 東京から金沢までは新幹線を利用するのが便利で、所要時間は約2時間半です。 | |
| Reference | Generated |
|---|---|
| Diverse Speaking Styles お待たせいたしました。お客様のSoftBank光のご契約状況が確認できました。あわせて、Y!mobileとのおうち割 光セットの適用状況をお調べしたいのですが、現在お使いの携帯電話番号をお伺いしてもよろしいでしょうか? | |
| Reference | Generated |
| English Generation There is something remarkable about the way language shapes the way we think. A single phrase, spoken in the right tone, can carry emotions that words alone cannot express. | |
| Reference | Generated |
Data and Open-Source Foundation
Sarashina2.2-TTS was trained exclusively on legitimately acquired and properly licensed speech data. It builds upon open-source projects including CosyVoice, HiFT-GAN, and 3D-Speaker.
best for
- ·Japanese text-to-speech with high pronunciation accuracy
- ·Zero-shot voice cloning for content creation
- ·Bilingual (Japanese-English) speech synthesis
- ·Customer service and narration voice generation
FAQ
It is a Japanese-centric text-to-speech system built on a large language model, developed by SB Intuitions, supporting Japanese and English with zero-shot voice generation.
It supports both Japanese and English text-to-speech synthesis, including code-switching within a single utterance.
It reproduces a speaker's voice, speaking style, and acoustic characteristics from a short reference audio clip, with no fine-tuning required.
Use the gigarouter OpenAI-compatible endpoint with your API key to send requests for speech synthesis.
The Docker image works on GPUs with about 6 GB VRAM using the HuggingFace Transformers backend; vLLM backend requires more VRAM for faster inference.
We're benchmarking and onboarding Sarashina 2.2 TTS as a hosted, OpenAI-compatible API. Sign in for free credit and be ready when it lands, or tell us you want it and we'll prioritize it.