MOSS-TTS

OpenMOSS-Team/MOSS-TTS

published Feb 2026 · updated Mar 2026

MOSS-TTS is a TTS model that performs zero-shot voice cloning, ultra-long speech generation (up to 1 hour), token-level duration control, phoneme-level pronunciation control, and multilingual/code-switched synthesis.

est. price

~$0.0075

· estimated, set at launch

API providers

downloads / mo

911.8K

license

apache-2.0

specs

Task	Text-to-Speech (TTS)
Architecture	Autoregressive discrete-token (MossTTSDelay) with MOSS-Audio-Tokenizer
Parameters	8B
Supported Languages	20 languages including Chinese, English, German, Spanish, French, Japanese, Korean, Arabic, and more

about this model

MOSS-TTS is a speech and sound generation model family from MOSI.AI and the OpenMOSS team, designed for high-fidelity, expressive, and complex real-world text-to-speech applications. The family comprises five production-ready models: MOSS-TTS (flagship TTS with zero-shot voice cloning, ultra-long speech up to one hour, token-level duration control, phoneme/pinyin pronunciation control, multilingual and code-switched synthesis), MOSS-TTSD (spoken dialogue generation that outperformed Doubao and Gemini 2.5-pro in subjective evaluations), MOSS-VoiceGenerator (text-prompted voice design surpassing other top-tier voice design models in arena ratings), MOSS-TTS-Realtime (multi-turn context-aware streaming TTS with time-to-first-byte of 180 ms and combined LLM-first-sentence plus TTFB of 377 ms), and MOSS-SoundEffect (sound effect generation for diverse categories with controllable duration).

Key Capabilities

Zero-shot voice cloning from short reference audio
Ultra-long speech generation up to 1 hour in a single run
Token-level duration control for precise pacing
Phoneme-level pronunciation control via Pinyin, IPA, or mixed input
Multilingual synthesis across 20 languages (see table) with smooth code-switching

Benchmark Performance

MOSS-TTSD v1.0 achieves industry-leading objective metrics and outperformed Doubao and Gemini 2.5-pro in subjective evaluations. MOSS-VoiceGenerator surpasses other top-tier voice design models in arena ratings. MOSS-TTS-Realtime delivers 180 ms TTFB and 377 ms combined latency with a text model.

Architecture

Model	Architecture	Size
MOSS-TTS	MossTTSDelay	8B
MOSS-TTS-Local-Transformer	MossTTSLocal	1.7B
MOSS-TTSD-v1.0	MossTTSDelay	8B
MOSS-VoiceGenerator	MossTTSDelay	1.7B
MOSS-SoundEffect	MossTTSDelay	8B
MOSS-TTS-Realtime	MossTTSRealtime	1.7B

Supported Languages

Language	Code	Language	Code	Language	Code
Chinese	zh	English	en	German	de
Spanish	es	French	fr	Japanese	ja
Italian	it	Hungarian	hu	Korean	ko
Russian	ru	Persian (Farsi)	fa	Arabic	ar
Polish	pl	Portuguese	pt	Czech	cs
Danish	da	Swedish	sv	Greek	el
Turkish	tr

best for

·Zero-shot voice cloning from short reference audio
·Ultra-long speech generation (up to one hour) for audiobooks and narration
·Multilingual and code-switched TTS in 20 languages
·Fine-grained control over pronunciation (Pinyin/IPA) and pacing

FAQ

What is MOSS-TTS best for?

High-fidelity zero-shot voice cloning, stable long-form speech, and multilingual TTS with detailed pronunciation control.

How many parameters does MOSS-TTS have?

8 billion parameters.

What languages does MOSS-TTS support?

20 languages including Chinese, English, German, Spanish, French, Japanese, Korean, Arabic, Russian, Italian, and more.

How can I use MOSS-TTS via API?

Use the gigarouter OpenAI-compatible endpoint with your API key. Refer to the API documentation for request format.

What is the architecture of MOSS-TTS?

It uses an autoregressive discrete-token architecture (MossTTSDelay) built on MOSS-Audio-Tokenizer, which compresses 24 kHz audio to 12.5 fps.

not yet live

We're benchmarking and onboarding MOSS-TTS as a hosted, OpenAI-compatible API. Sign in for free credit and be ready when it lands, or tell us you want it and we'll prioritize it.

related text-to-speech models

compare all →

XTTS-v2

9.3M dl/mo

Qwen3-TTS-12Hz-1.7B-CustomVoice

2M dl/mo

Qwen3-TTS-12Hz-0.6B-CustomVoice