F5 TTS
SWivid/F5-TTS
published Oct 2024 · updated Mar 2025
A popular open text-to-speech model, with 799.1K downloads a month. gigarouter benchmarks and hosts it as an OpenAI-compatible API.
specs
| Task | Text-to-Speech |
about this model
F5-TTS is a fully non-autoregressive text-to-speech model that generates fluent and faithful speech using flow matching with a Diffusion Transformer (DiT). It simplifies the TTS pipeline by eliminating the need for a separate duration model, text encoder, or phoneme alignment: text input is padded with filler tokens to match the speech length, then denoised directly. The model refines text representations with ConvNeXt V2 blocks to improve alignment with speech, and introduces Sway Sampling, an inference-time flow step strategy that boosts performance and efficiency without retraining.
Performance and training
F5-TTS was trained on the public Emilia Dataset, a multilingual corpus of 100K hours. It achieves an inference real-time factor (RTF) of 0.15, substantially faster than state-of-the-art diffusion-based TTS models. The model demonstrates strong zero-shot voice cloning, seamless code-switching between languages, and controllable speaking speed.
Architectural variants
The same repository includes two architectures:
| Model | Backbone | Key attribute |
|---|---|---|
| F5-TTS | Diffusion Transformer + ConvNeXt V2 | Faster training and inference |
| E2 TTS | Flat-UNet Transformer | Closest reproduction of the E2 TTS paper |
Both variants are available as checkpoints. F5-TTS v1 base model offers improved training and inference performance over the initial release.
License
This model is released under the CC-BY-NC-4.0 license, permitting non-commercial use with attribution.
FAQ
answer
We're benchmarking and onboarding F5 TTS as a hosted, OpenAI-compatible API. Sign in for free credit and be ready when it lands, or tell us you want it and we'll prioritize it.