MMS-TTS English
facebook/mms-tts-eng
published Aug 2023 · updated Sep 2023
MMS-TTS English is a text-to-speech model that synthesizes English speech from text using a VITS architecture.
specs
| Task | Text-to-Speech |
| Architecture | VITS (Variational Inference with adversarial learning for end-to-end Text-to-Speech) |
| Language | English |
| License | CC-BY-NC 4.0 |
about this model
facebook/mms-tts-eng is a text-to-speech (TTS) model that generates speech waveforms from English text input. It is a single-language checkpoint from the Massively Multilingual Speech (MMS) project, which scales speech technology to over 1,400 languages using self-supervised learning and a dataset derived from public religious texts.
Model Architecture
The model uses VITS (Variational Inference with adversarial learning for end-to-end Text-to-Speech), a conditional variational autoencoder that predicts speech waveforms directly. It comprises a posterior encoder, a flow-based module with a Transformer text encoder and multiple coupling layers, and a HiFi-GAN-style decoder. A stochastic duration predictor allows the model to produce varied speech rhythms from the same text, addressing the one-to-many nature of TTS. The system is trained end-to-end with a combination of variational lower bound and adversarial losses, and normalizing flows are applied to the conditional prior distribution to improve expressiveness.
Key Strengths
- Non-deterministic output: due to the stochastic duration predictor, the model can generate diverse prosody and timing; a fixed seed is required to reproduce the same waveform.
- Part of the MMS project, which provides TTS checkpoints for hundreds of languages, enabling cross-lingual applications.
- End-to-end generation eliminates the need for separate acoustic feature extraction or vocoder pipelines.
The model is licensed under CC-BY-NC 4.0. As a hosted API on gigarouter, you can call it directly without managing dependencies or inference code.
best for
- ·Generating English speech from text for voice applications
- ·Creating speech for multilingual TTS systems as the English component
- ·Synthesizing speech with variable rhythm from the same text input
FAQ
The input is a text string tokenized using the AutoTokenizer that comes with the model (e.g., from Hugging Face).
It outputs a waveform tensor representing the synthesized speech, which can be saved as a .wav file.
Use the gigarouter OpenAI-compatible endpoint with your API key, passing the text input as required by the endpoint.
It is licensed under CC-BY-NC 4.0, which allows only non-commercial use. Commercial use is not permitted.
We're benchmarking and onboarding MMS-TTS English as a hosted, OpenAI-compatible API. Sign in for free credit and be ready when it lands, or tell us you want it and we'll prioritize it.