models / text-to-speech · coming soon

SpeechT5 TTS

microsoft/speecht5_tts

published Feb 2023 · updated Nov 2023

SpeechT5 TTS is a text-to-speech model fine-tuned on LibriTTS for speech synthesis.

status

coming soon

API providers

downloads / mo

80.8K

license

mit

specs

Task	Text-to-Speech
Architecture	Encoder-decoder with shared encoder-decoder and modal-specific pre/post-nets (SpeechT5)
License	MIT
Training Data	LibriTTS (fine-tuning)

about this model

microsoft/speecht5_tts is a text-to-speech model built on the SpeechT5 unified-modal encoder-decoder framework, fine-tuned for speech synthesis on the LibriTTS dataset. The model was developed by Microsoft and published at ACL 2022.

Architecture

SpeechT5 uses a shared encoder-decoder network with six modal-specific pre-nets and post-nets for speech and text. The text encoder pre-net maps input tokens to hidden representations (similar to BERT), while the speech decoder employs a Tacotron2-inspired pre-net (linear layers for log mel spectrograms) and a post-net (residual refinement of spectrograms). A cross-modal vector quantization mechanism aligns speech and text representations in a unified semantic space.

Pre-training Data

The model was pre-trained on approximately 960 hours of LibriSpeech audio data combined with the LibriSpeech LM text dataset, both English. This self-supervised pre-training spans automatic speech recognition, speech synthesis, translation, voice conversion, enhancement, and speaker identification tasks.

Inference Details

Output is a mono 16 kHz waveform. The model requires a speaker embedding vector (e.g., from CMU Arctic xvectors) to control voice characteristics. No latency or throughput benchmarks are reported in the original card.

Research Basis

The underlying paper is “SpeechT5: Unified-Modal Encoder-Decoder Pre-Training for Spoken Language Processing” (arXiv:2110.07205). The model code and weights are released under the MIT license.

best for

·Generating speech from text for voice assistants
·Creating audiobooks or narration
·Building custom voice applications with speaker embeddings

FAQ

What is SpeechT5 TTS?

It is a unified-modal encoder-decoder model fine-tuned for text-to-speech on LibriTTS, producing natural-sounding speech.

What input and output formats does it support?

Input is text (string) and a speaker embedding vector; output is a mono 16 kHz speech waveform.

How can I call this model via the gigarouter API?

Use the OpenAI-compatible endpoint with your API key, sending a request with the input text and speaker embedding.

What license does this model use?

It is released under the MIT license.

Can this model be fine-tuned on new data?

Yes, the Hugging Face model card provides a Colab notebook for fine-tuning on a different dataset or language.

not yet live

We're benchmarking and onboarding SpeechT5 TTS as a hosted, OpenAI-compatible API. Sign in for free credit and be ready when it lands, or tell us you want it and we'll prioritize it.

related text-to-speech models

compare all →

XTTS-v2

9.3M dl/mo

Qwen3-TTS-12Hz-1.7B-CustomVoice

2M dl/mo

Qwen3-TTS-12Hz-0.6B-CustomVoice