VibeVoice 1.5B
microsoft/VibeVoice-1.5B
published Aug 2025 · updated Jan 2026
VibeVoice 1.5B is a text-to-speech model that generates expressive, long-form, multi-speaker conversational audio such as podcasts.
specs
| Task | Text-to-Speech |
| Architecture | Transformer-based LLM (Qwen2.5-1.5B) with continuous speech tokenizers and diffusion head |
| Parameters | 1.5B (LLM); tokenizers ~340M each; diffusion head ~123M |
| Context Length | 64K tokens (~90 min audio) |
| License | MIT License (research purposes only) |
about this model
Key Capabilities
- Generates long-form, multi-speaker conversational audio (e.g., podcasts) from text input.
- Supports up to 4 distinct speakers in a single generation.
- Context length of 64,000 tokens, enabling synthesis of up to 90 minutes of continuous speech.
- Built on a Qwen2.5-1.5B LLM with a diffusion-based decoding head (4 layers, ~123M parameters).
- Continuous speech tokenizers (acoustic and semantic) operate at 7.5 Hz, achieving 80× better data compression than Encodec while maintaining comparable audio fidelity.
Model Variants
| Model | Context Length | Generation Length |
|---|---|---|
| VibeVoice-0.5B-Streaming | — | — |
| VibeVoice-1.5B | 64K tokens | ~90 minutes |
| VibeVoice-Large | 32K tokens | ~45 minutes |
Architecture
The model uses a transformer-based LLM (Qwen2.5-1.5B) integrated with acoustic and semantic tokenizers and a diffusion-based decoding head. The acoustic tokenizer, based on a σ-VAE variant, achieves 3200× downsampling from 24 kHz input. The diffusion head (4 layers, ~123M parameters) predicts acoustic VAE features using a Denoising Diffusion Probabilistic Models (DDPM) process with Classifier-Free Guidance (CFG) and DPM-Solver during inference.
Benchmarks and Research Context
VibeVoice was accepted as an Oral at ICLR 2026. The underlying LatentLM framework, which VibeVoice builds upon, outperforms the state-of-the-art VALL-E 2 model in speaker similarity and robustness while requiring 10× fewer decoding steps. The TTS code repository was disabled on 2025-09-05 due to observed out-of-scope use; the model is intended for research purposes only.
Additional Information
The model is trained on English and Chinese data only. It does not handle overlapping speech, background noise, music, or other non-speech audio. Generated audio includes an audible AI disclaimer and an imperceptible watermark for provenance verification. For further details, refer to the technical report and project page.
best for
- ·Generating podcast-style conversations with multiple speakers
- ·Creating long-form audiobooks or dialogue-heavy content
FAQ
Up to 90 minutes of continuous speech with a 64K token context window.
Up to 4 distinct speakers in a single generated audio.
English and Chinese only; other languages may produce unexpected or unintelligible outputs.
Use the OpenAI-compatible endpoint with your gigarouter API key; refer to the gigarouter documentation for exact endpoint and request format.
No – it is released under the MIT License but explicitly limited to research purposes only; commercial or real-world use is not recommended.
We're benchmarking and onboarding VibeVoice 1.5B as a hosted, OpenAI-compatible API. Sign in for free credit and be ready when it lands, or tell us you want it and we'll prioritize it.