models / text-to-speech · coming soon

MMS-TTS English

facebook/mms-tts-eng

published Aug 2023 · updated Sep 2023

MMS-TTS English is a text-to-speech model that synthesizes English speech from text using a VITS architecture.

est. price

~$0.0075

· estimated, set at launch

API providers

downloads / mo

137K

license

cc-by-nc-4.0

specs

Task	Text-to-Speech
Architecture	VITS (Variational Inference with adversarial learning for end-to-end Text-to-Speech)
Language	English
License	CC-BY-NC 4.0

about this model

facebook/mms-tts-eng is a text-to-speech (TTS) model that generates speech waveforms from English text input. It is a single-language checkpoint from the Massively Multilingual Speech (MMS) project, which scales speech technology to over 1,400 languages using self-supervised learning and a dataset derived from public religious texts.

Model Architecture

The model uses VITS (Variational Inference with adversarial learning for end-to-end Text-to-Speech), a conditional variational autoencoder that predicts speech waveforms directly. It comprises a posterior encoder, a flow-based module with a Transformer text encoder and multiple coupling layers, and a HiFi-GAN-style decoder. A stochastic duration predictor allows the model to produce varied speech rhythms from the same text, addressing the one-to-many nature of TTS. The system is trained end-to-end with a combination of variational lower bound and adversarial losses, and normalizing flows are applied to the conditional prior distribution to improve expressiveness.

Key Strengths

Non-deterministic output: due to the stochastic duration predictor, the model can generate diverse prosody and timing; a fixed seed is required to reproduce the same waveform.
Part of the MMS project, which provides TTS checkpoints for hundreds of languages, enabling cross-lingual applications.
End-to-end generation eliminates the need for separate acoustic feature extraction or vocoder pipelines.

The model is licensed under CC-BY-NC 4.0. As a hosted API on gigarouter, you can call it directly without managing dependencies or inference code.

best for

·Generating English speech from text for voice applications
·Creating speech for multilingual TTS systems as the English component
·Synthesizing speech with variable rhythm from the same text input

FAQ

What is the input format for MMS-TTS English?

The input is a text string tokenized using the AutoTokenizer that comes with the model (e.g., from Hugging Face).

What output does the model produce?

It outputs a waveform tensor representing the synthesized speech, which can be saved as a .wav file.

How do I call this model via the gigarouter API?

Use the gigarouter OpenAI-compatible endpoint with your API key, passing the text input as required by the endpoint.

What license does this model use and can I use it commercially?

It is licensed under CC-BY-NC 4.0, which allows only non-commercial use. Commercial use is not permitted.

not yet live

We're benchmarking and onboarding MMS-TTS English as a hosted, OpenAI-compatible API. Sign in for free credit and be ready when it lands, or tell us you want it and we'll prioritize it.

related text-to-speech models

compare all →

XTTS-v2

9.3M dl/mo

Qwen3-TTS-12Hz-1.7B-CustomVoice

2M dl/mo

Qwen3-TTS-12Hz-0.6B-CustomVoice