VieNeu-TTS v3 Turbo
pnnbao-ump/VieNeu-TTS-v3-Turbo
published Jun 2026 · updated Jun 2026
VieNeu-TTS v3 Turbo is a Vietnamese TTS model that generates 48 kHz high-fidelity speech with instant voice cloning, built-in multi-speaker default voices, and experimental emotion cues.
specs
| Task | Text-to-Speech (TTS) |
| Architecture | Original design by Phạm Nguyễn Ngọc Bảo, trained from scratch; uses MOSS-Audio-Tokenizer-Nano codec |
| Parameters | Not specified |
| License | Apache License 2.0 |
about this model
VieNeu-TTS-v3-Turbo is a text-to-speech model that generates 48 kHz high-fidelity Vietnamese and bilingual English–Vietnamese speech with instant voice cloning, built-in multi-speaker default voices, and experimental emotion cues.
The model is an original architecture designed and trained from scratch by Phạm Nguyễn Ngọc Bảo on approximately 10,000 hours of English–Vietnamese speech. It uses the MOSS-Audio-Tokenizer-Nano neural audio codec and the sea-g2p grapheme-to-phoneme converter.
Key capabilities
- 48 kHz output — a substantial fidelity increase over the previous 24 kHz v2.
- Built-in default voices — ten preset voices (male and female) addressed by dedicated speaker tokens; no reference clip required for these voices.
- Instant voice cloning — clones a voice from a 3–5 second reference audio clip.
- Emotion and non-verbal cues (experimental) — supports inline tags
[cười](laugh),[thở dài](sigh), and[hắng giọng](clear throat). - Batched generation — synthesises multiple chunks in one pass, batch size up to 32, including multi-speaker conversation mode.
- Bilingual code-switching — seamless transitions between Vietnamese and English within a single utterance.
Default voices
| Voice | Gender | Style |
|---|---|---|
| Ngọc Lan (default) | Female | Soft / gentle |
| Ngọc Linh | Female | Bright |
| Trúc Ly | Female | Youthful |
| Mỹ Duyên | Female | Smooth |
| Xuân Vĩnh | Male | Upbeat |
| Thái Sơn | Male | Firm |
| Gia Bảo | Male | Smooth |
| Đức Trí | Male | Clear |
| Trọng Hữu | Male | Knowledgeable |
| Bình An | Male | Even / calm |
For any other voice, voice cloning with a short reference clip is used.
The model is distributed under the Apache License 2.0. A recommended temperature of 0.8 is suggested for stable results; higher values add expressiveness but may reduce stability.
best for
- ·Vietnamese text-to-speech with high-fidelity 48 kHz output
- ·Instant voice cloning from a 3–5 second audio clip
- ·Multi-speaker conversation generation with batched scripts
- ·Bilingual English–Vietnamese code-switching TTS
FAQ
It generates 48 kHz high-fidelity speech, a significant upgrade from the 24 kHz of v2.
Provide a 3–5 second reference audio clip via the ref_audio parameter in the SDK or API; no fine-tuning is needed.
There are 10 built-in voices (e.g., Ngọc Lan, Xuân Vĩnh) that can be selected by name via the voice parameter without any reference audio.
It is distributed under Apache License 2.0; attribution must be kept for the original project and this Hugging Face package.
Use the gigarouter OpenAI-compatible endpoint with your API key, passing the model name and input text as parameters.
We're benchmarking and onboarding VieNeu-TTS v3 Turbo as a hosted, OpenAI-compatible API. Sign in for free credit and be ready when it lands, or tell us you want it and we'll prioritize it.