skip to content
gigarouter gigarouter
models / text-to-speech · coming soon

Qwen3 TTS 12Hz 1.7B CustomVoice

Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice

published Jan 2026 · updated Jan 2026

Qwen3 TTS 12Hz 1.7B CustomVoice is a multilingual text-to-speech model that generates speech with style control over 9 premium timbres via user instructions, supporting streaming and low-latency synthesis.

est. price
~$0.0075
· estimated, set at launch
API providers
0
downloads / mo
2M
license
apache-2.0

specs

TaskText-to-Speech (TTS)
ArchitectureDiscrete multi-codebook language model with dual-track hybrid streaming
Parameters1.7B
LicenseApache 2.0

about this model

Qwen3-TTS-12Hz-1.7B-CustomVoice is a text-to-speech model that generates natural speech with instruction-driven control over timbre, emotion, and prosody across 10 languages, using a discrete multi-codebook LM architecture for end-to-end speech modeling. Built on the Qwen3-TTS-Tokenizer-12Hz (12.5 Hz, 16-layer multi-codebook), it achieves extreme bitrate reduction and ultra-low-latency streaming with a 97 ms first-packet emission. The model supports Chinese, English, Japanese, Korean, German, French, Russian, Portuguese, Spanish, and Italian, and offers nine premium built-in timbres covering varied gender, age, and dialectal profiles.

Key capabilities

  • Instruction-controlled voice: Natural language instructions adjust tone, speaking rate, and emotional expression (e.g., “用特别愤怒的语气说”).
  • Streaming generation: Dual-Track hybrid architecture outputs audio immediately after a single character is input, with end-to-end latency as low as 97 ms.
  • Multilingual & dialectal support: Native-quality output for each built-in speaker’s language; each speaker can also speak any supported language.

Built-in timbres

SpeakerDescriptionNative language
VivianBright, slightly edgy young female voiceChinese
SerenaWarm, gentle young female voiceChinese
Uncle_FuSeasoned male voice, low, mellow timbreChinese
DylanYouthful Beijing male voice, clear, naturalChinese (Beijing Dialect)
EricLively Chengdu male voice, slightly husky brightnessChinese (Sichuan Dialect)
RyanDynamic male voice, strong rhythmic driveEnglish
AidenSunny American male voice, clear midrangeEnglish
Ono_AnnaPlayful Japanese female voice, light nimble timbreJapanese
SoheeWarm Korean female voice, rich emotionKorean

Architecture & training

Qwen3-TTS architecture diagram showing dual-track LM and tokenizer components

Trained on over 5 million hours of speech data. 1,916,676,352 parameters (~1.7 B), weights in BF16 format. Licensed under Apache 2.0.

Benchmark performance

State-of-the-art results on diverse objective and subjective evaluations, including the TTS multilingual test set, InstructTTSEval, and the long speech test set (see technical report).

best for

FAQ

What languages does this model support?

It supports Chinese, English, Japanese, Korean, German, French, Russian, Portuguese, Spanish, and Italian.

How do I control the speaking style or emotion?

Pass an optional instruct string when calling generate_custom_voice, e.g. "Very happy." or "用特别愤怒的语气说".

What is the first-packet latency for streaming?

The model can emit the first audio packet in as low as 97ms after a single character input.

Is the model available as a hosted API on gigarouter?

Yes, you can call it via the OpenAI-compatible endpoint using an API key on gigarouter.

What is the license for this model?

It is released under the Apache 2.0 license.

not yet live

We're benchmarking and onboarding Qwen3 TTS 12Hz 1.7B CustomVoice as a hosted, OpenAI-compatible API. Sign in for free credit and be ready when it lands, or tell us you want it and we'll prioritize it.

related text-to-speech models

compare all →