MOSS-TTS
OpenMOSS-Team/MOSS-TTS
published Feb 2026 · updated Mar 2026
MOSS-TTS is a TTS model that performs zero-shot voice cloning, ultra-long speech generation (up to 1 hour), token-level duration control, phoneme-level pronunciation control, and multilingual/code-switched synthesis.
specs
| Task | Text-to-Speech (TTS) |
| Architecture | Autoregressive discrete-token (MossTTSDelay) with MOSS-Audio-Tokenizer |
| Parameters | 8B |
| Supported Languages | 20 languages including Chinese, English, German, Spanish, French, Japanese, Korean, Arabic, and more |
about this model
MOSS-TTS is a speech and sound generation model family from MOSI.AI and the OpenMOSS team, designed for high-fidelity, expressive, and complex real-world text-to-speech applications. The family comprises five production-ready models: MOSS-TTS (flagship TTS with zero-shot voice cloning, ultra-long speech up to one hour, token-level duration control, phoneme/pinyin pronunciation control, multilingual and code-switched synthesis), MOSS-TTSD (spoken dialogue generation that outperformed Doubao and Gemini 2.5-pro in subjective evaluations), MOSS-VoiceGenerator (text-prompted voice design surpassing other top-tier voice design models in arena ratings), MOSS-TTS-Realtime (multi-turn context-aware streaming TTS with time-to-first-byte of 180 ms and combined LLM-first-sentence plus TTFB of 377 ms), and MOSS-SoundEffect (sound effect generation for diverse categories with controllable duration).
Key Capabilities
- Zero-shot voice cloning from short reference audio
- Ultra-long speech generation up to 1 hour in a single run
- Token-level duration control for precise pacing
- Phoneme-level pronunciation control via Pinyin, IPA, or mixed input
- Multilingual synthesis across 20 languages (see table) with smooth code-switching
Benchmark Performance
MOSS-TTSD v1.0 achieves industry-leading objective metrics and outperformed Doubao and Gemini 2.5-pro in subjective evaluations. MOSS-VoiceGenerator surpasses other top-tier voice design models in arena ratings. MOSS-TTS-Realtime delivers 180 ms TTFB and 377 ms combined latency with a text model.
Architecture
| Model | Architecture | Size |
|---|---|---|
| MOSS-TTS | MossTTSDelay | 8B |
| MOSS-TTS-Local-Transformer | MossTTSLocal | 1.7B |
| MOSS-TTSD-v1.0 | MossTTSDelay | 8B |
| MOSS-VoiceGenerator | MossTTSDelay | 1.7B |
| MOSS-SoundEffect | MossTTSDelay | 8B |
| MOSS-TTS-Realtime | MossTTSRealtime | 1.7B |
Supported Languages
| Language | Code | Language | Code | Language | Code |
|---|---|---|---|---|---|
| Chinese | zh | English | en | German | de |
| Spanish | es | French | fr | Japanese | ja |
| Italian | it | Hungarian | hu | Korean | ko |
| Russian | ru | Persian (Farsi) | fa | Arabic | ar |
| Polish | pl | Portuguese | pt | Czech | cs |
| Danish | da | Swedish | sv | Greek | el |
| Turkish | tr | ||||
best for
- ·Zero-shot voice cloning from short reference audio
- ·Ultra-long speech generation (up to one hour) for audiobooks and narration
- ·Multilingual and code-switched TTS in 20 languages
- ·Fine-grained control over pronunciation (Pinyin/IPA) and pacing
FAQ
High-fidelity zero-shot voice cloning, stable long-form speech, and multilingual TTS with detailed pronunciation control.
8 billion parameters.
20 languages including Chinese, English, German, Spanish, French, Japanese, Korean, Arabic, Russian, Italian, and more.
Use the gigarouter OpenAI-compatible endpoint with your API key. Refer to the API documentation for request format.
It uses an autoregressive discrete-token architecture (MossTTSDelay) built on MOSS-Audio-Tokenizer, which compresses 24 kHz audio to 12.5 fps.
We're benchmarking and onboarding MOSS-TTS as a hosted, OpenAI-compatible API. Sign in for free credit and be ready when it lands, or tell us you want it and we'll prioritize it.