Qwen3 TTS VoiceDesign 1.7B

Qwen/Qwen3-TTS-12Hz-1.7B-VoiceDesign

published Jan 2026 · updated Jan 2026

Qwen3 TTS VoiceDesign 1.7B is a text-to-speech model that supports voice design, voice cloning, and natural language-based voice control for multilingual speech generation.

est. price

~$0.0075

· estimated, set at launch

API providers

downloads / mo

657.8K

license

apache-2.0

specs

Task	Text-to-Speech (TTS)
Architecture	Dual-track LM with discrete multi-codebook 12Hz tokenizer
Parameters	1.7B
License	Apache 2.0

about this model

Qwen3-TTS-12Hz-1.7B-VoiceDesign is a text-to-speech model that generates natural, multilingual speech from text input, supporting voice design, voice cloning, and natural-language-based voice control. It is part of the Qwen3-TTS family, trained on over 5 million hours of speech data covering 10 languages (Chinese, English, Japanese, Korean, German, French, Russian, Portuguese, Spanish, and Italian) and multiple dialectal voice profiles.

Architecture and Key Strengths

The model uses a dual-track LM architecture with the Qwen3-TTS-Tokenizer-12Hz, a 12.5 Hz, 16-layer multi-codebook tokenizer paired with a lightweight causal ConvNet for streaming reconstruction. This design enables extreme low-latency streaming with a first-packet latency of 97 ms. The model supports both streaming and non-streaming generation and allows fine-grained control over timbre, emotion, and prosody via natural language instructions. It also offers state-of-the-art voice cloning from just 3 seconds of reference audio.

Benchmark Performance

Zero-shot speech generation evaluation on the Seed-TTS test set yields the following word error rates (WER, lower is better):

Model	test-zh	test-en
Qwen3-TTS-12Hz-1.7B-VoiceDesign	0.77	1.24

Qwen3-TTS architecture diagram

The model is released under the Apache 2.0 license.

best for

·Creating custom voices from natural language descriptions
·3-second voice cloning for personalized speech
·Real-time streaming TTS with ultra-low latency

FAQ

What is the model best for?

Voice design, voice cloning, and controllable emotional/multilingual speech generation with low-latency streaming.

How does it compare to other Qwen3 TTS models?

This 1.7B VoiceDesign variant focuses on voice creation via natural language instructions, while other variants support custom voice or base generation.

What are the license terms?

The model is released under the Apache 2.0 license, allowing free use, modification, and distribution.

What are the input and output formats?

Input is text (with optional language, speaker, and instruction) via the API; output is a WAV audio waveform.

How can I call this model via the API?

Use the gigarouter OpenAI-compatible endpoint with your API key, passing the model name and required parameters.

not yet live

We're benchmarking and onboarding Qwen3 TTS VoiceDesign 1.7B as a hosted, OpenAI-compatible API. Sign in for free credit and be ready when it lands, or tell us you want it and we'll prioritize it.

related text-to-speech models

compare all →

XTTS-v2

9.3M dl/mo

Qwen3-TTS-12Hz-1.7B-CustomVoice

2M dl/mo

Qwen3-TTS-12Hz-0.6B-CustomVoice