MOSS-TTS v1.5

OpenMOSS-Team/MOSS-TTS-v1.5

published May 2026 · updated May 2026

MOSS-TTS v1.5 is a text-to-speech model that performs zero-shot voice cloning, multilingual synthesis, code-switching, and controlled speech generation with explicit pause markers.

est. price

~$0.0075

· estimated, set at launch

API providers

downloads / mo

205.8K

license

apache-2.0

specs

Task	Text-to-Speech (TTS)
Architecture	Autoregressive Transformer with discrete audio tokens (MOSS-Audio-Tokenizer)
Supported Languages	31 languages

about this model

MOSS-TTS-v1.5 is an autoregressive speech generation model that produces high-fidelity, expressive audio with zero-shot voice cloning, token-level duration control, phoneme- and pinyin-level pronunciation control, multilingual synthesis, code-switching, and stable long-form generation.

Key improvements over MOSS-TTS 1.0

Stronger multilingual synthesis with language tags – specifying the language improves quality on all supported languages compared to 1.0.
More stable voice cloning – higher speaker similarity and reduced variance across repeated generations.
Better long-reference, short-text cloning – handles mismatched reference-to-target length more reliably.
More stable punctuation-following prosody – follows punctuation-driven pauses more closely, especially in long sentences.
Explicit pause control – supports inline markers such as [pause 3.2s] for precise timing.

Supported languages (31 total)

Language	Code	Language	Code	Language	Code
Chinese	zh	Cantonese	yue	English	en
Arabic	ar	Czech	cs	Danish	da
Dutch	nl	Finnish	fi	French	fr
German	de	Greek	el	Hebrew	he
Hindi	hi	Hungarian	hu	Italian	it
Japanese	ja	Korean	ko	Macedonian	mk
Malay	ms	Persian (Farsi)	fa	Polish	pl
Portuguese	pt	Romanian	ro	Russian	ru
Spanish	es	Swahili	sw	Swedish	sv
Tagalog	tl	Thai	th	Turkish	tr
Vietnamese	vi

The model uses a causal Transformer tokenizer (MOSS-Audio-Tokenizer) that compresses 24 kHz audio to 12.5 fps with variable-bitrate RVQ. It is designed for scalable, real-world deployment and is hosted by gigarouter as an OpenAI-compatible API. For the full 1.0 feature walkthrough and evaluation details, refer to the MOSS-TTS 1.0 README and the technical report (arXiv:2603.18090).

best for

·Zero-shot voice cloning from a short reference audio
·Long-form multilingual speech generation with code-switching
·Explicit pause control for precise timing in synthetic speech

FAQ

What languages does MOSS-TTS v1.5 support?

It supports 31 languages including Chinese, English, French, German, Japanese, Korean, and many others.

How does v1.5 improve over v1.0?

v1.5 adds stronger multilingual synthesis with language tags, more stable voice cloning, better long-reference cloning, and explicit pause control.

How can I use this model via the API?

Use the gigarouter OpenAI-compatible endpoint with an API key; refer to the hosting service documentation for endpoint details.

Does this model require a GPU?

The model can run on CPU but is optimized for CUDA GPUs with FlashAttention 2 support for faster inference.

What is the input format?

Input is text with optional reference audio and language tag; use the processor.build_user_message() method.

not yet live

We're benchmarking and onboarding MOSS-TTS v1.5 as a hosted, OpenAI-compatible API. Sign in for free credit and be ready when it lands, or tell us you want it and we'll prioritize it.

related text-to-speech models

compare all →

XTTS-v2

9.3M dl/mo

Qwen3-TTS-12Hz-1.7B-CustomVoice

2M dl/mo

Qwen3-TTS-12Hz-0.6B-CustomVoice