SeamlessM4T Medium
facebook/hf-seamless-m4t-medium
published Aug 2023 · updated Dec 2023
SeamlessM4T Medium is a unified multilingual translation model supporting text-to-speech, speech-to-speech, speech-to-text, and text-to-text tasks across up to 196 languages.
specs
| Task | Text-to-Speech Translation / Multilingual Translation |
| Architecture | SeamlessM4T (encoder-decoder with w2v-BERT 2.0) |
| Parameters | 1.2B |
| License | CC-BY-NC 4.0 |
about this model
hf-seamless-m4t-medium is a text-to-speech (TTS) model that generates natural speech from text across 35 languages, built on a unified multimodal translation architecture that also supports speech-to-speech, speech-to-text, and text-to-text translation. Developed by Meta and hosted by gigarouter as a managed API, it enables direct text-to-speech generation without requiring separate ASR or TTS pipelines.
Capabilities
- Accepts text input in 196 languages and produces speech output in 35 languages (with dedicated speaker identities).
- Supports simultaneous text and speech generation (using
return_intermediate_token_ids=True). - Provides a single model for five tasks: S2ST, S2TT, T2ST, T2TT, and ASR—all accessible via the same API endpoint.
Performance
On the FLEURS benchmark, the model achieves a 20% BLEU improvement over previous state-of-the-art in direct speech-to-text translation. It improves into-English translation by 1.3 BLEU points in speech-to-text and 2.6 ASR-BLEU points in speech-to-speech compared to strong cascaded systems. The model was trained on 1 million hours of open speech audio using w2v-BERT 2.0 self-supervised learning.
Additional Details
- Model size: 1.2 billion parameters.
- License: CC-BY-NC 4.0 (non-commercial).
- Safety evaluations include gender bias and added toxicity metrics, reported in the SeamlessM4T paper.
- Full evaluation metrics (BLEU, WER, chrF) are available for download.
best for
- ·Text-to-speech translation across 35 output languages
- ·Speech-to-speech translation for real-time interpretation
- ·Multilingual content creation with automatic speech recognition
FAQ
It supports speech-to-speech translation (S2ST), speech-to-text translation (S2TT), text-to-speech translation (T2ST), text-to-text translation (T2TT), and automatic speech recognition (ASR) from a single model.
It covers 101 languages for speech input, 196 languages for text input/output, and 35 languages for speech output.
Use the gigarouter OpenAI-compatible endpoint with an API key. Send requests with the required input (text or audio) and target language; the API returns translated text or speech.
It is released under the CC-BY-NC 4.0 license, which allows non-commercial use with attribution.
On FLEURS, SeamlessM4T achieves a 20% BLEU improvement over prior SOTA in direct speech-to-text translation and improves into-English translation by 1.3 BLEU points in speech-to-text and 2.6 ASR-BLEU points in speech-to-speech compared to strong cascaded models.
We're benchmarking and onboarding SeamlessM4T Medium as a hosted, OpenAI-compatible API. Sign in for free credit and be ready when it lands, or tell us you want it and we'll prioritize it.