XTTS V2

coqui/XTTS-v2

published Oct 2023 · updated Dec 2023

XTTS V2 is a text-to-speech model that supports voice cloning from a short audio clip and multilingual speech generation.

status

coming soon

API providers

downloads / mo

9.3M

license

other

specs

Task	Text-to-Speech
Architecture	Tortoise-based
Languages	17
Sampling Rate	24 kHz
License	Coqui Public Model License

about this model

Coqui XTTS-v2 is a text-to-speech model that generates high-quality speech with voice cloning from a short audio sample, supporting cross-language cloning across multiple languages.

The model enables voice cloning using as little as a 6-second audio clip (or a 3-second sample per documentation) and can transfer emotion and style from the reference speaker. It supports 17 languages: English, Spanish, French, German, Italian, Portuguese, Polish, Turkish, Russian, Dutch, Czech, Arabic, Chinese (zh-cn), Japanese, Hungarian, Korean, and Hindi. Speech is generated at a 24 kHz sampling rate.

Key improvements over XTTS-v1 include architectural enhancements for speaker conditioning, support for multiple speaker references and speaker interpolation, stability improvements, and better prosody and audio quality. Two new languages (Hungarian and Korean) were added.

XTTS-v2 can stream with less than 200 ms latency. The model is built on the Tortoise architecture and is designed for cross-language cloning. Fine-tuning example recipes are available, e.g., for LJSpeech; the underlying TTS library offers over 1,100 pretrained models.

For inference, built-in speakers (such as “Ana Florence”) are provided, and the model can accept one or multiple reference WAV files without runtime penalty.

Additional Resources

The model is licensed under the Coqui Public Model License (CPML). The broader TTS library is available under MPL-2.0.

best for

·Cloning a voice from a 6-second audio sample to generate speech in multiple languages
·Real-time speech generation with low latency for interactive applications
·Creating multilingual voiceovers for videos or podcasts

FAQ

What is the minimum audio length required for voice cloning?

A 6-second audio clip is sufficient.

How many languages does XTTS V2 support?

It supports 17 languages including English, Spanish, French, German, Italian, Portuguese, Polish, Turkish, Russian, Dutch, Czech, Arabic, Chinese, Japanese, Hungarian, Korean, and Hindi.

What is the output sampling rate?

24 kHz.

Can I use the model for streaming?

Yes, XTTS V2 can stream with less than 200ms latency.

How do I access the model via API?

Use the gigarouter OpenAI-compatible endpoint with an API key.

not yet live

We're benchmarking and onboarding XTTS V2 as a hosted, OpenAI-compatible API. Sign in for free credit and be ready when it lands, or tell us you want it and we'll prioritize it.

related text-to-speech models

compare all →

Qwen3-TTS-12Hz-1.7B-CustomVoice

2M dl/mo

Qwen3-TTS-12Hz-0.6B-CustomVoice