VibeVoice 1.5B

microsoft/VibeVoice-1.5B

published Aug 2025 · updated Jan 2026

VibeVoice 1.5B is a text-to-speech model that generates expressive, long-form, multi-speaker conversational audio such as podcasts.

est. price

~$0.0075

· estimated, set at launch

API providers

downloads / mo

235.5K

license

mit

specs

Task	Text-to-Speech
Architecture	Transformer-based LLM (Qwen2.5-1.5B) with continuous speech tokenizers and diffusion head
Parameters	1.5B (LLM); tokenizers ~340M each; diffusion head ~123M
Context Length	64K tokens (~90 min audio)
License	MIT License (research purposes only)

about this model

VibeVoice-1.5B is a text-to-speech (TTS) model that generates expressive, long-form, multi-speaker conversational audio from text, such as podcasts or dialogue. It is hosted on gigarouter as a managed, OpenAI-compatible API. The model uses a next-token diffusion framework built on a Qwen2.5-1.5B large language model (LLM), combined with continuous speech tokenizers (acoustic and semantic) operating at an ultra-low frame rate of 7.5 Hz. This architecture enables synthesis of speech up to 90 minutes long with up to 4 distinct speakers, capturing natural turn-taking and conversational flow. The tokenizer achieves an 80× improvement in data compression over Encodec while maintaining comparable audio fidelity.

Key Capabilities

Generates long-form, multi-speaker conversational audio (e.g., podcasts) from text input.
Supports up to 4 distinct speakers in a single generation.
Context length of 64,000 tokens, enabling synthesis of up to 90 minutes of continuous speech.
Built on a Qwen2.5-1.5B LLM with a diffusion-based decoding head (4 layers, ~123M parameters).
Continuous speech tokenizers (acoustic and semantic) operate at 7.5 Hz, achieving 80× better data compression than Encodec while maintaining comparable audio fidelity.

Model Variants

Model	Context Length	Generation Length
VibeVoice-0.5B-Streaming	—	—
VibeVoice-1.5B	64K tokens	~90 minutes
VibeVoice-Large	32K tokens	~45 minutes

Architecture

The model uses a transformer-based LLM (Qwen2.5-1.5B) integrated with acoustic and semantic tokenizers and a diffusion-based decoding head. The acoustic tokenizer, based on a σ-VAE variant, achieves 3200× downsampling from 24 kHz input. The diffusion head (4 layers, ~123M parameters) predicts acoustic VAE features using a Denoising Diffusion Probabilistic Models (DDPM) process with Classifier-Free Guidance (CFG) and DPM-Solver during inference.

Benchmarks and Research Context

VibeVoice was accepted as an Oral at ICLR 2026. The underlying LatentLM framework, which VibeVoice builds upon, outperforms the state-of-the-art VALL-E 2 model in speaker similarity and robustness while requiring 10× fewer decoding steps. The TTS code repository was disabled on 2025-09-05 due to observed out-of-scope use; the model is intended for research purposes only.

Additional Information

The model is trained on English and Chinese data only. It does not handle overlapping speech, background noise, music, or other non-speech audio. Generated audio includes an audible AI disclaimer and an imperceptible watermark for provenance verification. For further details, refer to the technical report and project page.

best for

·Generating podcast-style conversations with multiple speakers
·Creating long-form audiobooks or dialogue-heavy content

FAQ

What is the maximum audio length VibeVoice 1.5B can produce?

Up to 90 minutes of continuous speech with a 64K token context window.

How many distinct speakers does it support?

Up to 4 distinct speakers in a single generated audio.

What languages are supported?

English and Chinese only; other languages may produce unexpected or unintelligible outputs.

How can I call this model via the gigarouter API?

Use the OpenAI-compatible endpoint with your gigarouter API key; refer to the gigarouter documentation for exact endpoint and request format.

Is the model licensed for commercial use?

No – it is released under the MIT License but explicitly limited to research purposes only; commercial or real-world use is not recommended.

not yet live

We're benchmarking and onboarding VibeVoice 1.5B as a hosted, OpenAI-compatible API. Sign in for free credit and be ready when it lands, or tell us you want it and we'll prioritize it.

related text-to-speech models

compare all →

XTTS-v2

9.3M dl/mo

Qwen3-TTS-12Hz-1.7B-CustomVoice

2M dl/mo

Qwen3-TTS-12Hz-0.6B-CustomVoice