SpeechT5 TTS
microsoft/speecht5_tts
published Feb 2023 · updated Nov 2023
SpeechT5 TTS is a text-to-speech model fine-tuned on LibriTTS for speech synthesis.
specs
| Task | Text-to-Speech |
| Architecture | Encoder-decoder with shared encoder-decoder and modal-specific pre/post-nets (SpeechT5) |
| License | MIT |
| Training Data | LibriTTS (fine-tuning) |
about this model
microsoft/speecht5_tts is a text-to-speech model built on the SpeechT5 unified-modal encoder-decoder framework, fine-tuned for speech synthesis on the LibriTTS dataset. The model was developed by Microsoft and published at ACL 2022.
Architecture
SpeechT5 uses a shared encoder-decoder network with six modal-specific pre-nets and post-nets for speech and text. The text encoder pre-net maps input tokens to hidden representations (similar to BERT), while the speech decoder employs a Tacotron2-inspired pre-net (linear layers for log mel spectrograms) and a post-net (residual refinement of spectrograms). A cross-modal vector quantization mechanism aligns speech and text representations in a unified semantic space.
Pre-training Data
The model was pre-trained on approximately 960 hours of LibriSpeech audio data combined with the LibriSpeech LM text dataset, both English. This self-supervised pre-training spans automatic speech recognition, speech synthesis, translation, voice conversion, enhancement, and speaker identification tasks.
Inference Details
Output is a mono 16 kHz waveform. The model requires a speaker embedding vector (e.g., from CMU Arctic xvectors) to control voice characteristics. No latency or throughput benchmarks are reported in the original card.
Research Basis
The underlying paper is “SpeechT5: Unified-Modal Encoder-Decoder Pre-Training for Spoken Language Processing” (arXiv:2110.07205). The model code and weights are released under the MIT license.
best for
- ·Generating speech from text for voice assistants
- ·Creating audiobooks or narration
- ·Building custom voice applications with speaker embeddings
FAQ
It is a unified-modal encoder-decoder model fine-tuned for text-to-speech on LibriTTS, producing natural-sounding speech.
Input is text (string) and a speaker embedding vector; output is a mono 16 kHz speech waveform.
Use the OpenAI-compatible endpoint with your API key, sending a request with the input text and speaker embedding.
It is released under the MIT license.
Yes, the Hugging Face model card provides a Colab notebook for fine-tuning on a different dataset or language.
We're benchmarking and onboarding SpeechT5 TTS as a hosted, OpenAI-compatible API. Sign in for free credit and be ready when it lands, or tell us you want it and we'll prioritize it.