VoxCPM2
openbmb/VoxCPM2
published Apr 2026 · updated Apr 2026
VoxCPM2 is a tokenizer-free, diffusion autoregressive Text-to-Speech model with 2B parameters, supporting 30 languages and 48kHz studio-quality audio output.
specs
| Task | Text-to-Speech |
| Architecture | Tokenizer-free Diffusion Autoregressive (LocEnc → TSLM → RALM → LocDiT) |
| Parameters | 2B |
| License | Apache-2.0 |
| Supported Languages | 30 languages |
| Output Sample Rate | 48 kHz |
about this model
VoxCPM2 is a tokenizer-free diffusion autoregressive text-to-speech model with 2 billion parameters, supporting 30 languages and generating 48 kHz studio-quality audio. Trained on over 2 million hours of multilingual speech data, it accepts 16 kHz reference audio and outputs 48 kHz via its built-in AudioVAE V2 super-resolution.
Key Capabilities
- Voice Design – Create a novel voice from a natural-language description alone (e.g., gender, age, emotion, pace) without any reference audio.
- Controllable Voice Cloning – Clone any voice from a short audio clip, with optional style guidance to steer emotion, pace, and expression while preserving timbre.
- Ultimate Cloning – Provide both reference audio and its exact transcript for maximum fidelity, reproducing every vocal nuance.
- Context-Aware Synthesis – Automatically infers appropriate prosody and expressiveness from the input text.
- Real-Time Streaming – Achieves a real-time factor (RTF) of approximately 0.30 on an NVIDIA RTX 4090, and approximately 0.13 when accelerated with Nano-VLLM or vLLM-Omni.
Benchmark Results
On VoxCPM2’s internal 30-language evaluation set, it attains an average Word Error Rate (WER) of 1.68%. The model also achieves state-of-the-art or competitive results on major zero-shot and controllable TTS benchmarks, as reported in the GitHub repository.
Licensing and Availability
VoxCPM2 is released under the Apache-2.0 license, free for commercial use. The model weights are available on Hugging Face and ModelScope. Technical details are provided in the VoxCPM2 Technical Report (arXiv:2606.06928).
best for
- ·Multilingual text-to-speech across 30 languages
- ·Voice design from natural-language description (e.g., gender, tone, emotion)
- ·Controllable voice cloning with optional style guidance
- ·Real-time streaming TTS with low latency
FAQ
It excels at multilingual speech synthesis, creative voice design without reference audio, and controllable voice cloning with style control, all in real-time.
It supports 30 languages, including Arabic, Chinese, English, French, German, Japanese, Korean, Spanish, and many others, with no language tag needed.
VoxCPM2 outputs 48kHz studio-quality audio via its AudioVAE V2 with built-in super-resolution, accepting 16kHz reference input.
Released under Apache-2.0 license, free for commercial use.
Send requests to the gigarouter OpenAI-compatible endpoint with your API key; refer to the gigarouter documentation for the exact endpoint and request format.
We're benchmarking and onboarding VoxCPM2 as a hosted, OpenAI-compatible API. Sign in for free credit and be ready when it lands, or tell us you want it and we'll prioritize it.