skip to content
gigarouter gigarouter
models / text-to-speech · coming soon

SpeechT5 TTS

microsoft/speecht5_tts

published Feb 2023 · updated Nov 2023

SpeechT5 TTS is a text-to-speech model fine-tuned on LibriTTS for speech synthesis.

status
coming soon
API providers
0
downloads / mo
80.8K
license
mit

specs

TaskText-to-Speech
ArchitectureEncoder-decoder with shared encoder-decoder and modal-specific pre/post-nets (SpeechT5)
LicenseMIT
Training DataLibriTTS (fine-tuning)

about this model

microsoft/speecht5_tts is a text-to-speech model built on the SpeechT5 unified-modal encoder-decoder framework, fine-tuned for speech synthesis on the LibriTTS dataset. The model was developed by Microsoft and published at ACL 2022.

Architecture

SpeechT5 uses a shared encoder-decoder network with six modal-specific pre-nets and post-nets for speech and text. The text encoder pre-net maps input tokens to hidden representations (similar to BERT), while the speech decoder employs a Tacotron2-inspired pre-net (linear layers for log mel spectrograms) and a post-net (residual refinement of spectrograms). A cross-modal vector quantization mechanism aligns speech and text representations in a unified semantic space.

Pre-training Data

The model was pre-trained on approximately 960 hours of LibriSpeech audio data combined with the LibriSpeech LM text dataset, both English. This self-supervised pre-training spans automatic speech recognition, speech synthesis, translation, voice conversion, enhancement, and speaker identification tasks.

Inference Details

Output is a mono 16 kHz waveform. The model requires a speaker embedding vector (e.g., from CMU Arctic xvectors) to control voice characteristics. No latency or throughput benchmarks are reported in the original card.

Research Basis

The underlying paper is “SpeechT5: Unified-Modal Encoder-Decoder Pre-Training for Spoken Language Processing” (arXiv:2110.07205). The model code and weights are released under the MIT license.

best for

FAQ

What is SpeechT5 TTS?

It is a unified-modal encoder-decoder model fine-tuned for text-to-speech on LibriTTS, producing natural-sounding speech.

What input and output formats does it support?

Input is text (string) and a speaker embedding vector; output is a mono 16 kHz speech waveform.

How can I call this model via the gigarouter API?

Use the OpenAI-compatible endpoint with your API key, sending a request with the input text and speaker embedding.

What license does this model use?

It is released under the MIT license.

Can this model be fine-tuned on new data?

Yes, the Hugging Face model card provides a Colab notebook for fine-tuning on a different dataset or language.

not yet live

We're benchmarking and onboarding SpeechT5 TTS as a hosted, OpenAI-compatible API. Sign in for free credit and be ready when it lands, or tell us you want it and we'll prioritize it.

related text-to-speech models

compare all →