skip to content
gigarouter gigarouter
models / text-to-speech · coming soon

Qwen3-TTS 12Hz 0.6B CustomVoice

Qwen/Qwen3-TTS-12Hz-0.6B-CustomVoice

published Jan 2026 · updated Jan 2026

Qwen3-TTS 12Hz 0.6B CustomVoice is a multilingual text-to-speech model that supports custom voice generation with fine-grained style control across 10 languages.

est. price
~$0.0075
· estimated, set at launch
API providers
0
downloads / mo
1.2M
license
apache-2.0

specs

TaskText-to-Speech (TTS)
ArchitectureDual-track LM with 12.5Hz multi-codebook tokenizer
Parameters0.6 billion
LicenseApache 2.0

about this model

Qwen3-TTS-12Hz-0.6B-CustomVoice is a multilingual text-to-speech model that generates speech with fine-grained style control through natural language instructions, supporting voice cloning in three seconds and streaming output with 97 ms first-packet latency.

The model is trained on over 5 million hours of speech data covering 10 languages: Chinese, English, Japanese, Korean, German, French, Russian, Portuguese, Spanish, and Italian. It uses a dual-track language model architecture with a 12.5 Hz, 16-layer multi-codebook tokenizer (Qwen-TTS-Tokenizer-12Hz) and a lightweight causal ConvNet for real-time streaming synthesis.

Speaker Profiles

The custom voice variant includes nine built-in timbres, each with a recommended native language:

SpeakerVoice DescriptionNative Language
VivianBright young female voiceChinese
SerenaWarm, gentle young female voiceChinese
Uncle_FuSeasoned male voice, mellow timbreChinese
DylanYouthful Beijing male voiceChinese (Beijing)
EricLively Chengdu male voiceChinese (Sichuan)
RyanDynamic male voice with rhythmEnglish
AidenSunny American male voiceEnglish
Ono_AnnaPlayful Japanese female voiceJapanese
SoheeWarm Korean female voiceKorean

Key Capabilities

  • Style control: Adapts tone, rhythm, and emotional expression via prompts such as “Speak in a very happy tone.”
  • Voice cloning: Clone a target voice from a three-second reference, or design entirely novel voices through description.
  • Streaming output: End-to-end latency as low as 97 ms enables real-time speech generation.
  • License: Apache 2.0, covering both tokenizer and model weights.

best for

FAQ

What is the end-to-end latency of this model?

End-to-end synthesis latency is as low as 97ms.

Which languages are supported?

Chinese, English, Japanese, Korean, German, French, Russian, Portuguese, Spanish, and Italian.

How do I call this model via the gigarouter API?

Use the OpenAI-compatible endpoint with your API key; refer to gigarouter documentation for details.

What is the license for this model?

Apache 2.0.

Can I clone a voice with this model?

Yes, it supports 3-second voice cloning as well as voice design and voice design-then-clone.

not yet live

We're benchmarking and onboarding Qwen3-TTS 12Hz 0.6B CustomVoice as a hosted, OpenAI-compatible API. Sign in for free credit and be ready when it lands, or tell us you want it and we'll prioritize it.

related text-to-speech models

compare all →