skip to content
gigarouter gigarouter
models / text-to-speech · coming soon

MOSS-TTS v1.5

OpenMOSS-Team/MOSS-TTS-v1.5

published May 2026 · updated May 2026

MOSS-TTS v1.5 is a text-to-speech model that performs zero-shot voice cloning, multilingual synthesis, code-switching, and controlled speech generation with explicit pause markers.

est. price
~$0.0075
· estimated, set at launch
API providers
0
downloads / mo
205.8K
license
apache-2.0

specs

TaskText-to-Speech (TTS)
ArchitectureAutoregressive Transformer with discrete audio tokens (MOSS-Audio-Tokenizer)
Supported Languages31 languages

about this model

MOSS-TTS-v1.5 is an autoregressive speech generation model that produces high-fidelity, expressive audio with zero-shot voice cloning, token-level duration control, phoneme- and pinyin-level pronunciation control, multilingual synthesis, code-switching, and stable long-form generation.

Key improvements over MOSS-TTS 1.0

  • Stronger multilingual synthesis with language tags – specifying the language improves quality on all supported languages compared to 1.0.
  • More stable voice cloning – higher speaker similarity and reduced variance across repeated generations.
  • Better long-reference, short-text cloning – handles mismatched reference-to-target length more reliably.
  • More stable punctuation-following prosody – follows punctuation-driven pauses more closely, especially in long sentences.
  • Explicit pause control – supports inline markers such as [pause 3.2s] for precise timing.

Supported languages (31 total)

LanguageCodeLanguageCodeLanguageCode
ChinesezhCantoneseyueEnglishen
ArabicarCzechcsDanishda
DutchnlFinnishfiFrenchfr
GermandeGreekelHebrewhe
HindihiHungarianhuItalianit
JapanesejaKoreankoMacedonianmk
MalaymsPersian (Farsi)faPolishpl
PortugueseptRomanianroRussianru
SpanishesSwahiliswSwedishsv
TagalogtlThaithTurkishtr
Vietnamesevi

The model uses a causal Transformer tokenizer (MOSS-Audio-Tokenizer) that compresses 24 kHz audio to 12.5 fps with variable-bitrate RVQ. It is designed for scalable, real-world deployment and is hosted by gigarouter as an OpenAI-compatible API. For the full 1.0 feature walkthrough and evaluation details, refer to the MOSS-TTS 1.0 README and the technical report (arXiv:2603.18090).

best for

FAQ

What languages does MOSS-TTS v1.5 support?

It supports 31 languages including Chinese, English, French, German, Japanese, Korean, and many others.

How does v1.5 improve over v1.0?

v1.5 adds stronger multilingual synthesis with language tags, more stable voice cloning, better long-reference cloning, and explicit pause control.

How can I use this model via the API?

Use the gigarouter OpenAI-compatible endpoint with an API key; refer to the hosting service documentation for endpoint details.

Does this model require a GPU?

The model can run on CPU but is optimized for CUDA GPUs with FlashAttention 2 support for faster inference.

What is the input format?

Input is text with optional reference audio and language tag; use the processor.build_user_message() method.

not yet live

We're benchmarking and onboarding MOSS-TTS v1.5 as a hosted, OpenAI-compatible API. Sign in for free credit and be ready when it lands, or tell us you want it and we'll prioritize it.

related text-to-speech models

compare all →