F5 TTS

SWivid/F5-TTS

published Oct 2024 · updated Mar 2025

A popular open text-to-speech model, with 799.1K downloads a month. gigarouter benchmarks and hosts it as an OpenAI-compatible API.

status

coming soon

API providers

downloads / mo

799.1K

license

cc-by-nc-4.0

specs

Task	Text-to-Speech

about this model

F5-TTS is a fully non-autoregressive text-to-speech model that generates fluent and faithful speech using flow matching with a Diffusion Transformer (DiT). It simplifies the TTS pipeline by eliminating the need for a separate duration model, text encoder, or phoneme alignment: text input is padded with filler tokens to match the speech length, then denoised directly. The model refines text representations with ConvNeXt V2 blocks to improve alignment with speech, and introduces Sway Sampling, an inference-time flow step strategy that boosts performance and efficiency without retraining.

Performance and training

F5-TTS was trained on the public Emilia Dataset, a multilingual corpus of 100K hours. It achieves an inference real-time factor (RTF) of 0.15, substantially faster than state-of-the-art diffusion-based TTS models. The model demonstrates strong zero-shot voice cloning, seamless code-switching between languages, and controllable speaking speed.

Architectural variants

The same repository includes two architectures:

Model	Backbone	Key attribute
F5-TTS	Diffusion Transformer + ConvNeXt V2	Faster training and inference
E2 TTS	Flat-UNet Transformer	Closest reproduction of the E2 TTS paper

Both variants are available as checkpoints. F5-TTS v1 base model offers improved training and inference performance over the initial release.

License

This model is released under the CC-BY-NC-4.0 license, permitting non-commercial use with attribution.

FAQ

question

answer

not yet live

We're benchmarking and onboarding F5 TTS as a hosted, OpenAI-compatible API. Sign in for free credit and be ready when it lands, or tell us you want it and we'll prioritize it.

related text-to-speech models

compare all →

XTTS-v2

9.3M dl/mo

Qwen3-TTS-12Hz-1.7B-CustomVoice

2M dl/mo

Qwen3-TTS-12Hz-0.6B-CustomVoice