skip to content
gigarouter gigarouter
models / text-to-speech · coming soon

MMS-TTS English

facebook/mms-tts-eng

published Aug 2023 · updated Sep 2023

MMS-TTS English is a text-to-speech model that synthesizes English speech from text using a VITS architecture.

est. price
~$0.0075
· estimated, set at launch
API providers
0
downloads / mo
137K
license
cc-by-nc-4.0

specs

TaskText-to-Speech
ArchitectureVITS (Variational Inference with adversarial learning for end-to-end Text-to-Speech)
LanguageEnglish
LicenseCC-BY-NC 4.0

about this model

facebook/mms-tts-eng is a text-to-speech (TTS) model that generates speech waveforms from English text input. It is a single-language checkpoint from the Massively Multilingual Speech (MMS) project, which scales speech technology to over 1,400 languages using self-supervised learning and a dataset derived from public religious texts.

Model Architecture

The model uses VITS (Variational Inference with adversarial learning for end-to-end Text-to-Speech), a conditional variational autoencoder that predicts speech waveforms directly. It comprises a posterior encoder, a flow-based module with a Transformer text encoder and multiple coupling layers, and a HiFi-GAN-style decoder. A stochastic duration predictor allows the model to produce varied speech rhythms from the same text, addressing the one-to-many nature of TTS. The system is trained end-to-end with a combination of variational lower bound and adversarial losses, and normalizing flows are applied to the conditional prior distribution to improve expressiveness.

Key Strengths

  • Non-deterministic output: due to the stochastic duration predictor, the model can generate diverse prosody and timing; a fixed seed is required to reproduce the same waveform.
  • Part of the MMS project, which provides TTS checkpoints for hundreds of languages, enabling cross-lingual applications.
  • End-to-end generation eliminates the need for separate acoustic feature extraction or vocoder pipelines.

The model is licensed under CC-BY-NC 4.0. As a hosted API on gigarouter, you can call it directly without managing dependencies or inference code.

best for

FAQ

What is the input format for MMS-TTS English?

The input is a text string tokenized using the AutoTokenizer that comes with the model (e.g., from Hugging Face).

What output does the model produce?

It outputs a waveform tensor representing the synthesized speech, which can be saved as a .wav file.

How do I call this model via the gigarouter API?

Use the gigarouter OpenAI-compatible endpoint with your API key, passing the text input as required by the endpoint.

What license does this model use and can I use it commercially?

It is licensed under CC-BY-NC 4.0, which allows only non-commercial use. Commercial use is not permitted.

not yet live

We're benchmarking and onboarding MMS-TTS English as a hosted, OpenAI-compatible API. Sign in for free credit and be ready when it lands, or tell us you want it and we'll prioritize it.

related text-to-speech models

compare all →