Sarashina 2.2 TTS

sbintuitions/sarashina2.2-tts

published Apr 2026 · updated Jul 2026

Sarashina 2.2 TTS is a Japanese-centric text-to-speech system built on a large language model, supporting Japanese and English with zero-shot voice generation.

est. price

~$0.0075

· estimated, set at launch

API providers

downloads / mo

59.6K

specs

Task	Text-to-Speech (TTS)
Architecture	LLM-based TTS
Training Data Scale	~361k hours of speech
Languages	Japanese and English

about this model

Sarashina2.2-TTS is a Japanese-centric text-to-speech model built on a large language model, developed by SB Intuitions. It synthesizes natural, expressive speech in Japanese and English, supporting zero-shot voice generation from a short reference clip across diverse speaking styles including narration, broadcast, conversation, and customer service.

Training and Performance

Sarashina2.2-TTS was trained on approximately 361,000 hours of speech with a balanced mix of Japanese and English data. The model achieves state-of-the-art kanji-level reading accuracy on the newly introduced Joyo Kanji Yomi Benchmark (covering all 2,136 Joyo kanji and their 4,378 readings) and delivers the highest speaker similarity in zero-shot Japanese speech synthesis. It matches top baselines on general sentence-level pronunciation. A dedicated metric, Kana-CER, compares synthesized speech against reference readings in the kana space to directly measure pronunciation correctness.

Cross-Lingual and Code-Switching Robustness

Sarashina2.2-TTS is the only system that maintains stable Japanese pronunciation regardless of the prompt language, demonstrating strong cross-lingual robustness. It handles code-switching within a single utterance naturally, preserving speaker identity across languages.

Audio Samples

The samples below demonstrate zero-shot speaker adaptation, diverse speaking styles, and English generation capabilities. Additional samples for style variety, zero-shot voice cloning, cross-lingual generation, and code-switching are available on the model card.

Reference	Generated
Zero-shot Speaker Adaptation 東京から金沢までは新幹線を利用するのが便利で、所要時間は約２時間半です。

Diverse Speaking Styles お待たせいたしました。お客様のSoftBank光のご契約状況が確認できました。あわせて、Y!mobileとのおうち割光セットの適用状況をお調べしたいのですが、現在お使いの携帯電話番号をお伺いしてもよろしいでしょうか？
Reference	Generated

English Generation There is something remarkable about the way language shapes the way we think. A single phrase, spoken in the right tone, can carry emotions that words alone cannot express.
Reference	Generated

Data and Open-Source Foundation

Sarashina2.2-TTS was trained exclusively on legitimately acquired and properly licensed speech data. It builds upon open-source projects including CosyVoice, HiFT-GAN, and 3D-Speaker.

best for

·Japanese text-to-speech with high pronunciation accuracy
·Zero-shot voice cloning for content creation
·Bilingual (Japanese-English) speech synthesis
·Customer service and narration voice generation

FAQ

What is Sarashina 2.2 TTS?

It is a Japanese-centric text-to-speech system built on a large language model, developed by SB Intuitions, supporting Japanese and English with zero-shot voice generation.

What languages does it support?

It supports both Japanese and English text-to-speech synthesis, including code-switching within a single utterance.

How does zero-shot voice generation work?

It reproduces a speaker's voice, speaking style, and acoustic characteristics from a short reference audio clip, with no fine-tuning required.

How can I call this model via the gigarouter API?

Use the gigarouter OpenAI-compatible endpoint with your API key to send requests for speech synthesis.

What are the hardware requirements for running this model?

The Docker image works on GPUs with about 6 GB VRAM using the HuggingFace Transformers backend; vLLM backend requires more VRAM for faster inference.

not yet live

We're benchmarking and onboarding Sarashina 2.2 TTS as a hosted, OpenAI-compatible API. Sign in for free credit and be ready when it lands, or tell us you want it and we'll prioritize it.

related text-to-speech models

compare all →

XTTS-v2

9.3M dl/mo

Qwen3-TTS-12Hz-1.7B-CustomVoice

2M dl/mo

Qwen3-TTS-12Hz-0.6B-CustomVoice