skip to content
gigarouter gigarouter
models / text-to-speech · coming soon

XTTS V2

coqui/XTTS-v2

published Oct 2023 · updated Dec 2023

XTTS V2 is a text-to-speech model that supports voice cloning from a short audio clip and multilingual speech generation.

status
coming soon
API providers
0
downloads / mo
9.3M
license
other

specs

TaskText-to-Speech
ArchitectureTortoise-based
Languages17
Sampling Rate24 kHz
LicenseCoqui Public Model License

about this model

Coqui XTTS-v2 is a text-to-speech model that generates high-quality speech with voice cloning from a short audio sample, supporting cross-language cloning across multiple languages.

The model enables voice cloning using as little as a 6-second audio clip (or a 3-second sample per documentation) and can transfer emotion and style from the reference speaker. It supports 17 languages: English, Spanish, French, German, Italian, Portuguese, Polish, Turkish, Russian, Dutch, Czech, Arabic, Chinese (zh-cn), Japanese, Hungarian, Korean, and Hindi. Speech is generated at a 24 kHz sampling rate.

Key improvements over XTTS-v1 include architectural enhancements for speaker conditioning, support for multiple speaker references and speaker interpolation, stability improvements, and better prosody and audio quality. Two new languages (Hungarian and Korean) were added.

XTTS-v2 can stream with less than 200 ms latency. The model is built on the Tortoise architecture and is designed for cross-language cloning. Fine-tuning example recipes are available, e.g., for LJSpeech; the underlying TTS library offers over 1,100 pretrained models.

For inference, built-in speakers (such as “Ana Florence”) are provided, and the model can accept one or multiple reference WAV files without runtime penalty.

Additional Resources

The model is licensed under the Coqui Public Model License (CPML). The broader TTS library is available under MPL-2.0.

best for

FAQ

What is the minimum audio length required for voice cloning?

A 6-second audio clip is sufficient.

How many languages does XTTS V2 support?

It supports 17 languages including English, Spanish, French, German, Italian, Portuguese, Polish, Turkish, Russian, Dutch, Czech, Arabic, Chinese, Japanese, Hungarian, Korean, and Hindi.

What is the output sampling rate?

24 kHz.

Can I use the model for streaming?

Yes, XTTS V2 can stream with less than 200ms latency.

How do I access the model via API?

Use the gigarouter OpenAI-compatible endpoint with an API key.

not yet live

We're benchmarking and onboarding XTTS V2 as a hosted, OpenAI-compatible API. Sign in for free credit and be ready when it lands, or tell us you want it and we'll prioritize it.

related text-to-speech models

compare all →