Qwen3 TTS VoiceDesign 1.7B
Qwen/Qwen3-TTS-12Hz-1.7B-VoiceDesign
published Jan 2026 · updated Jan 2026
Qwen3 TTS VoiceDesign 1.7B is a text-to-speech model that supports voice design, voice cloning, and natural language-based voice control for multilingual speech generation.
specs
| Task | Text-to-Speech (TTS) |
| Architecture | Dual-track LM with discrete multi-codebook 12Hz tokenizer |
| Parameters | 1.7B |
| License | Apache 2.0 |
about this model
Qwen3-TTS-12Hz-1.7B-VoiceDesign is a text-to-speech model that generates natural, multilingual speech from text input, supporting voice design, voice cloning, and natural-language-based voice control. It is part of the Qwen3-TTS family, trained on over 5 million hours of speech data covering 10 languages (Chinese, English, Japanese, Korean, German, French, Russian, Portuguese, Spanish, and Italian) and multiple dialectal voice profiles.
Architecture and Key Strengths
The model uses a dual-track LM architecture with the Qwen3-TTS-Tokenizer-12Hz, a 12.5 Hz, 16-layer multi-codebook tokenizer paired with a lightweight causal ConvNet for streaming reconstruction. This design enables extreme low-latency streaming with a first-packet latency of 97 ms. The model supports both streaming and non-streaming generation and allows fine-grained control over timbre, emotion, and prosody via natural language instructions. It also offers state-of-the-art voice cloning from just 3 seconds of reference audio.
Benchmark Performance
Zero-shot speech generation evaluation on the Seed-TTS test set yields the following word error rates (WER, lower is better):
| Model | test-zh | test-en |
|---|---|---|
| Qwen3-TTS-12Hz-1.7B-VoiceDesign | 0.77 | 1.24 |

The model is released under the Apache 2.0 license.
best for
- ·Creating custom voices from natural language descriptions
- ·3-second voice cloning for personalized speech
- ·Real-time streaming TTS with ultra-low latency
FAQ
Voice design, voice cloning, and controllable emotional/multilingual speech generation with low-latency streaming.
This 1.7B VoiceDesign variant focuses on voice creation via natural language instructions, while other variants support custom voice or base generation.
The model is released under the Apache 2.0 license, allowing free use, modification, and distribution.
Input is text (with optional language, speaker, and instruction) via the API; output is a WAV audio waveform.
Use the gigarouter OpenAI-compatible endpoint with your API key, passing the model name and required parameters.
We're benchmarking and onboarding Qwen3 TTS VoiceDesign 1.7B as a hosted, OpenAI-compatible API. Sign in for free credit and be ready when it lands, or tell us you want it and we'll prioritize it.