skip to content
gigarouter gigarouter
models / text-to-speech · coming soon

E2 TTS

SWivid/E2-TTS

published Oct 2024 · updated Mar 2025

E2 TTS is a fully non-autoregressive zero-shot text-to-speech model using flow matching.

status
coming soon
API providers
0
downloads / mo
108.8K
license
cc-by-nc-4.0

specs

TaskText-to-Speech
ArchitectureFlat-UNet Transformer
LicenseCC BY-NC 4.0

about this model

E2 TTS (Embarrassingly Easy Text-to-Speech) is a fully non-autoregressive, zero-shot text-to-speech model that converts text into speech using a flow-matching-based mel spectrogram generator trained on an audio infilling task. The input text is represented as a character sequence with filler tokens, eliminating the need for additional components such as a duration model, grapheme-to-phoneme conversion, or monotonic alignment search.

The model employs a Flat-UNet Transformer architecture and achieves its zero-shot capability through flow-matching training on large-scale speech data. E2 TTS delivers human-level naturalness and state-of-the-art speaker similarity and intelligibility. According to the original paper, its zero-shot performance is comparable to or surpasses previous systems including Voicebox and NaturalSpeech 3.

E2 TTS is released under the CC BY-NC 4.0 license. On the Hugging Face hub, the model has received over 799,000 monthly downloads and is used in 461 Spaces, with 131 fine-tuned variants and 4 quantizations available. The model is hosted on gigarouter as a managed, OpenAI-compatible API, allowing developers to integrate it directly without managing infrastructure.

For further details, see the paper, the GitHub repository, and demo samples at https://aka.ms/e2tts/.

best for

FAQ

What is E2 TTS?

E2 TTS is a fully non-autoregressive zero-shot text-to-speech model that uses flow matching to generate mel spectrograms from text with filler tokens, achieving human-level naturalness.

What input format does E2 TTS expect?

The model takes text as input and an audio prompt for voice cloning; generated mel spectrograms are then converted to waveform.

How can I use E2 TTS via the gigarouter API?

Use the gigarouter OpenAI-compatible endpoint with an API key to send requests for text-to-speech synthesis.

Is E2 TTS free to use?

The model is licensed under CC BY-NC 4.0, which allows non-commercial use with attribution.

How does E2 TTS compare to other zero-shot TTS models?

E2 TTS achieves state-of-the-art speaker similarity and intelligibility comparable to Voicebox and NaturalSpeech 3 while being simpler and fully non-autoregressive.

not yet live

We're benchmarking and onboarding E2 TTS as a hosted, OpenAI-compatible API. Sign in for free credit and be ready when it lands, or tell us you want it and we'll prioritize it.

related text-to-speech models

compare all →