MOSS-TTS v1.5
OpenMOSS-Team/MOSS-TTS-v1.5
published May 2026 · updated May 2026
MOSS-TTS v1.5 is a text-to-speech model that performs zero-shot voice cloning, multilingual synthesis, code-switching, and controlled speech generation with explicit pause markers.
specs
| Task | Text-to-Speech (TTS) |
| Architecture | Autoregressive Transformer with discrete audio tokens (MOSS-Audio-Tokenizer) |
| Supported Languages | 31 languages |
about this model
MOSS-TTS-v1.5 is an autoregressive speech generation model that produces high-fidelity, expressive audio with zero-shot voice cloning, token-level duration control, phoneme- and pinyin-level pronunciation control, multilingual synthesis, code-switching, and stable long-form generation.
Key improvements over MOSS-TTS 1.0
- Stronger multilingual synthesis with language tags – specifying the language improves quality on all supported languages compared to 1.0.
- More stable voice cloning – higher speaker similarity and reduced variance across repeated generations.
- Better long-reference, short-text cloning – handles mismatched reference-to-target length more reliably.
- More stable punctuation-following prosody – follows punctuation-driven pauses more closely, especially in long sentences.
- Explicit pause control – supports inline markers such as
[pause 3.2s]for precise timing.
Supported languages (31 total)
| Language | Code | Language | Code | Language | Code |
|---|---|---|---|---|---|
| Chinese | zh | Cantonese | yue | English | en |
| Arabic | ar | Czech | cs | Danish | da |
| Dutch | nl | Finnish | fi | French | fr |
| German | de | Greek | el | Hebrew | he |
| Hindi | hi | Hungarian | hu | Italian | it |
| Japanese | ja | Korean | ko | Macedonian | mk |
| Malay | ms | Persian (Farsi) | fa | Polish | pl |
| Portuguese | pt | Romanian | ro | Russian | ru |
| Spanish | es | Swahili | sw | Swedish | sv |
| Tagalog | tl | Thai | th | Turkish | tr |
| Vietnamese | vi | ||||
The model uses a causal Transformer tokenizer (MOSS-Audio-Tokenizer) that compresses 24 kHz audio to 12.5 fps with variable-bitrate RVQ. It is designed for scalable, real-world deployment and is hosted by gigarouter as an OpenAI-compatible API. For the full 1.0 feature walkthrough and evaluation details, refer to the MOSS-TTS 1.0 README and the technical report (arXiv:2603.18090).
best for
- ·Zero-shot voice cloning from a short reference audio
- ·Long-form multilingual speech generation with code-switching
- ·Explicit pause control for precise timing in synthetic speech
FAQ
It supports 31 languages including Chinese, English, French, German, Japanese, Korean, and many others.
v1.5 adds stronger multilingual synthesis with language tags, more stable voice cloning, better long-reference cloning, and explicit pause control.
Use the gigarouter OpenAI-compatible endpoint with an API key; refer to the hosting service documentation for endpoint details.
The model can run on CPU but is optimized for CUDA GPUs with FlashAttention 2 support for faster inference.
Input is text with optional reference audio and language tag; use the processor.build_user_message() method.
We're benchmarking and onboarding MOSS-TTS v1.5 as a hosted, OpenAI-compatible API. Sign in for free credit and be ready when it lands, or tell us you want it and we'll prioritize it.