skip to content
gigarouter gigarouter
models / text-to-speech · coming soon

MOSS-TTS

OpenMOSS-Team/MOSS-TTS

published Feb 2026 · updated Mar 2026

MOSS-TTS is a TTS model that performs zero-shot voice cloning, ultra-long speech generation (up to 1 hour), token-level duration control, phoneme-level pronunciation control, and multilingual/code-switched synthesis.

est. price
~$0.0075
· estimated, set at launch
API providers
0
downloads / mo
911.8K
license
apache-2.0

specs

TaskText-to-Speech (TTS)
ArchitectureAutoregressive discrete-token (MossTTSDelay) with MOSS-Audio-Tokenizer
Parameters8B
Supported Languages20 languages including Chinese, English, German, Spanish, French, Japanese, Korean, Arabic, and more

about this model

MOSS-TTS is a speech and sound generation model family from MOSI.AI and the OpenMOSS team, designed for high-fidelity, expressive, and complex real-world text-to-speech applications. The family comprises five production-ready models: MOSS-TTS (flagship TTS with zero-shot voice cloning, ultra-long speech up to one hour, token-level duration control, phoneme/pinyin pronunciation control, multilingual and code-switched synthesis), MOSS-TTSD (spoken dialogue generation that outperformed Doubao and Gemini 2.5-pro in subjective evaluations), MOSS-VoiceGenerator (text-prompted voice design surpassing other top-tier voice design models in arena ratings), MOSS-TTS-Realtime (multi-turn context-aware streaming TTS with time-to-first-byte of 180 ms and combined LLM-first-sentence plus TTFB of 377 ms), and MOSS-SoundEffect (sound effect generation for diverse categories with controllable duration).

MOSS-TTS Family logo

Key Capabilities

  • Zero-shot voice cloning from short reference audio
  • Ultra-long speech generation up to 1 hour in a single run
  • Token-level duration control for precise pacing
  • Phoneme-level pronunciation control via Pinyin, IPA, or mixed input
  • Multilingual synthesis across 20 languages (see table) with smooth code-switching

Benchmark Performance

MOSS-TTSD v1.0 achieves industry-leading objective metrics and outperformed Doubao and Gemini 2.5-pro in subjective evaluations. MOSS-VoiceGenerator surpasses other top-tier voice design models in arena ratings. MOSS-TTS-Realtime delivers 180 ms TTFB and 377 ms combined latency with a text model.

Architecture

ModelArchitectureSize
MOSS-TTSMossTTSDelay8B
MOSS-TTS-Local-TransformerMossTTSLocal1.7B
MOSS-TTSD-v1.0MossTTSDelay8B
MOSS-VoiceGeneratorMossTTSDelay1.7B
MOSS-SoundEffectMossTTSDelay8B
MOSS-TTS-RealtimeMossTTSRealtime1.7B
MOSS-TTS Family introduction diagram

Supported Languages

LanguageCodeLanguageCodeLanguageCode
ChinesezhEnglishenGermande
SpanishesFrenchfrJapaneseja
ItalianitHungarianhuKoreanko
RussianruPersian (Farsi)faArabicar
PolishplPortugueseptCzechcs
DanishdaSwedishsvGreekel
Turkishtr

best for

FAQ

What is MOSS-TTS best for?

High-fidelity zero-shot voice cloning, stable long-form speech, and multilingual TTS with detailed pronunciation control.

How many parameters does MOSS-TTS have?

8 billion parameters.

What languages does MOSS-TTS support?

20 languages including Chinese, English, German, Spanish, French, Japanese, Korean, Arabic, Russian, Italian, and more.

How can I use MOSS-TTS via API?

Use the gigarouter OpenAI-compatible endpoint with your API key. Refer to the API documentation for request format.

What is the architecture of MOSS-TTS?

It uses an autoregressive discrete-token architecture (MossTTSDelay) built on MOSS-Audio-Tokenizer, which compresses 24 kHz audio to 12.5 fps.

not yet live

We're benchmarking and onboarding MOSS-TTS as a hosted, OpenAI-compatible API. Sign in for free credit and be ready when it lands, or tell us you want it and we'll prioritize it.

related text-to-speech models

compare all →