Qwen3-TTS 12Hz 0.6B CustomVoice

Qwen/Qwen3-TTS-12Hz-0.6B-CustomVoice

published Jan 2026 · updated Jan 2026

Qwen3-TTS 12Hz 0.6B CustomVoice is a multilingual text-to-speech model that supports custom voice generation with fine-grained style control across 10 languages.

est. price

~$0.0075

· estimated, set at launch

API providers

downloads / mo

1.2M

license

apache-2.0

specs

Task	Text-to-Speech (TTS)
Architecture	Dual-track LM with 12.5Hz multi-codebook tokenizer
Parameters	0.6 billion
License	Apache 2.0

about this model

Qwen3-TTS-12Hz-0.6B-CustomVoice is a multilingual text-to-speech model that generates speech with fine-grained style control through natural language instructions, supporting voice cloning in three seconds and streaming output with 97 ms first-packet latency.

The model is trained on over 5 million hours of speech data covering 10 languages: Chinese, English, Japanese, Korean, German, French, Russian, Portuguese, Spanish, and Italian. It uses a dual-track language model architecture with a 12.5 Hz, 16-layer multi-codebook tokenizer (Qwen-TTS-Tokenizer-12Hz) and a lightweight causal ConvNet for real-time streaming synthesis.

Speaker Profiles

The custom voice variant includes nine built-in timbres, each with a recommended native language:

Speaker	Voice Description	Native Language
Vivian	Bright young female voice	Chinese
Serena	Warm, gentle young female voice	Chinese
Uncle_Fu	Seasoned male voice, mellow timbre	Chinese
Dylan	Youthful Beijing male voice	Chinese (Beijing)
Eric	Lively Chengdu male voice	Chinese (Sichuan)
Ryan	Dynamic male voice with rhythm	English
Aiden	Sunny American male voice	English
Ono_Anna	Playful Japanese female voice	Japanese
Sohee	Warm Korean female voice	Korean

Key Capabilities

Style control: Adapts tone, rhythm, and emotional expression via prompts such as “Speak in a very happy tone.”
Voice cloning: Clone a target voice from a three-second reference, or design entirely novel voices through description.
Streaming output: End-to-end latency as low as 97 ms enables real-time speech generation.
License: Apache 2.0, covering both tokenizer and model weights.

best for

·Custom voice generation with natural language style control
·Multilingual TTS for 10 languages including Chinese, English, Japanese, and more
·Low-latency streaming speech synthesis

FAQ

What is the end-to-end latency of this model?

End-to-end synthesis latency is as low as 97ms.

Which languages are supported?

Chinese, English, Japanese, Korean, German, French, Russian, Portuguese, Spanish, and Italian.

How do I call this model via the gigarouter API?

Use the OpenAI-compatible endpoint with your API key; refer to gigarouter documentation for details.

What is the license for this model?

Apache 2.0.

Can I clone a voice with this model?

Yes, it supports 3-second voice cloning as well as voice design and voice design-then-clone.

not yet live

We're benchmarking and onboarding Qwen3-TTS 12Hz 0.6B CustomVoice as a hosted, OpenAI-compatible API. Sign in for free credit and be ready when it lands, or tell us you want it and we'll prioritize it.

related text-to-speech models

compare all →

XTTS-v2

9.3M dl/mo

Qwen3-TTS-12Hz-1.7B-CustomVoice