XTTS V2
coqui/XTTS-v2
published Oct 2023 · updated Dec 2023
XTTS V2 is a text-to-speech model that supports voice cloning from a short audio clip and multilingual speech generation.
specs
| Task | Text-to-Speech |
| Architecture | Tortoise-based |
| Languages | 17 |
| Sampling Rate | 24 kHz |
| License | Coqui Public Model License |
about this model
Coqui XTTS-v2 is a text-to-speech model that generates high-quality speech with voice cloning from a short audio sample, supporting cross-language cloning across multiple languages.
The model enables voice cloning using as little as a 6-second audio clip (or a 3-second sample per documentation) and can transfer emotion and style from the reference speaker. It supports 17 languages: English, Spanish, French, German, Italian, Portuguese, Polish, Turkish, Russian, Dutch, Czech, Arabic, Chinese (zh-cn), Japanese, Hungarian, Korean, and Hindi. Speech is generated at a 24 kHz sampling rate.
Key improvements over XTTS-v1 include architectural enhancements for speaker conditioning, support for multiple speaker references and speaker interpolation, stability improvements, and better prosody and audio quality. Two new languages (Hungarian and Korean) were added.
XTTS-v2 can stream with less than 200 ms latency. The model is built on the Tortoise architecture and is designed for cross-language cloning. Fine-tuning example recipes are available, e.g., for LJSpeech; the underlying TTS library offers over 1,100 pretrained models.
For inference, built-in speakers (such as “Ana Florence”) are provided, and the model can accept one or multiple reference WAV files without runtime penalty.
Additional Resources
The model is licensed under the Coqui Public Model License (CPML). The broader TTS library is available under MPL-2.0.
best for
- ·Cloning a voice from a 6-second audio sample to generate speech in multiple languages
- ·Real-time speech generation with low latency for interactive applications
- ·Creating multilingual voiceovers for videos or podcasts
FAQ
A 6-second audio clip is sufficient.
It supports 17 languages including English, Spanish, French, German, Italian, Portuguese, Polish, Turkish, Russian, Dutch, Czech, Arabic, Chinese, Japanese, Hungarian, Korean, and Hindi.
24 kHz.
Yes, XTTS V2 can stream with less than 200ms latency.
Use the gigarouter OpenAI-compatible endpoint with an API key.
We're benchmarking and onboarding XTTS V2 as a hosted, OpenAI-compatible API. Sign in for free credit and be ready when it lands, or tell us you want it and we'll prioritize it.