Qwen3-TTS 12Hz 0.6B CustomVoice
Qwen/Qwen3-TTS-12Hz-0.6B-CustomVoice
published Jan 2026 · updated Jan 2026
Qwen3-TTS 12Hz 0.6B CustomVoice is a multilingual text-to-speech model that supports custom voice generation with fine-grained style control across 10 languages.
specs
| Task | Text-to-Speech (TTS) |
| Architecture | Dual-track LM with 12.5Hz multi-codebook tokenizer |
| Parameters | 0.6 billion |
| License | Apache 2.0 |
about this model
Qwen3-TTS-12Hz-0.6B-CustomVoice is a multilingual text-to-speech model that generates speech with fine-grained style control through natural language instructions, supporting voice cloning in three seconds and streaming output with 97 ms first-packet latency.
The model is trained on over 5 million hours of speech data covering 10 languages: Chinese, English, Japanese, Korean, German, French, Russian, Portuguese, Spanish, and Italian. It uses a dual-track language model architecture with a 12.5 Hz, 16-layer multi-codebook tokenizer (Qwen-TTS-Tokenizer-12Hz) and a lightweight causal ConvNet for real-time streaming synthesis.
Speaker Profiles
The custom voice variant includes nine built-in timbres, each with a recommended native language:
| Speaker | Voice Description | Native Language |
|---|---|---|
| Vivian | Bright young female voice | Chinese |
| Serena | Warm, gentle young female voice | Chinese |
| Uncle_Fu | Seasoned male voice, mellow timbre | Chinese |
| Dylan | Youthful Beijing male voice | Chinese (Beijing) |
| Eric | Lively Chengdu male voice | Chinese (Sichuan) |
| Ryan | Dynamic male voice with rhythm | English |
| Aiden | Sunny American male voice | English |
| Ono_Anna | Playful Japanese female voice | Japanese |
| Sohee | Warm Korean female voice | Korean |
Key Capabilities
- Style control: Adapts tone, rhythm, and emotional expression via prompts such as “Speak in a very happy tone.”
- Voice cloning: Clone a target voice from a three-second reference, or design entirely novel voices through description.
- Streaming output: End-to-end latency as low as 97 ms enables real-time speech generation.
- License: Apache 2.0, covering both tokenizer and model weights.
best for
- ·Custom voice generation with natural language style control
- ·Multilingual TTS for 10 languages including Chinese, English, Japanese, and more
- ·Low-latency streaming speech synthesis
FAQ
End-to-end synthesis latency is as low as 97ms.
Chinese, English, Japanese, Korean, German, French, Russian, Portuguese, Spanish, and Italian.
Use the OpenAI-compatible endpoint with your API key; refer to gigarouter documentation for details.
Apache 2.0.
Yes, it supports 3-second voice cloning as well as voice design and voice design-then-clone.
We're benchmarking and onboarding Qwen3-TTS 12Hz 0.6B CustomVoice as a hosted, OpenAI-compatible API. Sign in for free credit and be ready when it lands, or tell us you want it and we'll prioritize it.