VoxCPM2

openbmb/VoxCPM2

published Apr 2026 · updated Apr 2026

VoxCPM2 is a tokenizer-free, diffusion autoregressive Text-to-Speech model with 2B parameters, supporting 30 languages and 48kHz studio-quality audio output.

est. price

~$0.0075

· estimated, set at launch

API providers

downloads / mo

640.8K

license

apache-2.0

specs

Task	Text-to-Speech
Architecture	Tokenizer-free Diffusion Autoregressive (LocEnc → TSLM → RALM → LocDiT)
Parameters	2B
License	Apache-2.0
Supported Languages	30 languages
Output Sample Rate	48 kHz

about this model

VoxCPM2 is a tokenizer-free diffusion autoregressive text-to-speech model with 2 billion parameters, supporting 30 languages and generating 48 kHz studio-quality audio. Trained on over 2 million hours of multilingual speech data, it accepts 16 kHz reference audio and outputs 48 kHz via its built-in AudioVAE V2 super-resolution.

Key Capabilities

Voice Design – Create a novel voice from a natural-language description alone (e.g., gender, age, emotion, pace) without any reference audio.
Controllable Voice Cloning – Clone any voice from a short audio clip, with optional style guidance to steer emotion, pace, and expression while preserving timbre.
Ultimate Cloning – Provide both reference audio and its exact transcript for maximum fidelity, reproducing every vocal nuance.
Context-Aware Synthesis – Automatically infers appropriate prosody and expressiveness from the input text.
Real-Time Streaming – Achieves a real-time factor (RTF) of approximately 0.30 on an NVIDIA RTX 4090, and approximately 0.13 when accelerated with Nano-VLLM or vLLM-Omni.

Benchmark Results

On VoxCPM2’s internal 30-language evaluation set, it attains an average Word Error Rate (WER) of 1.68%. The model also achieves state-of-the-art or competitive results on major zero-shot and controllable TTS benchmarks, as reported in the GitHub repository.

Licensing and Availability

VoxCPM2 is released under the Apache-2.0 license, free for commercial use. The model weights are available on Hugging Face and ModelScope. Technical details are provided in the VoxCPM2 Technical Report (arXiv:2606.06928).

best for

·Multilingual text-to-speech across 30 languages
·Voice design from natural-language description (e.g., gender, tone, emotion)
·Controllable voice cloning with optional style guidance
·Real-time streaming TTS with low latency

FAQ

What is VoxCPM2 best used for?

It excels at multilingual speech synthesis, creative voice design without reference audio, and controllable voice cloning with style control, all in real-time.

How many languages does VoxCPM2 support?

It supports 30 languages, including Arabic, Chinese, English, French, German, Japanese, Korean, Spanish, and many others, with no language tag needed.

What is the output audio quality?

VoxCPM2 outputs 48kHz studio-quality audio via its AudioVAE V2 with built-in super-resolution, accepting 16kHz reference input.

What is the license of VoxCPM2?

Released under Apache-2.0 license, free for commercial use.

How do I use VoxCPM2 via the gigarouter API?

Send requests to the gigarouter OpenAI-compatible endpoint with your API key; refer to the gigarouter documentation for the exact endpoint and request format.

not yet live

We're benchmarking and onboarding VoxCPM2 as a hosted, OpenAI-compatible API. Sign in for free credit and be ready when it lands, or tell us you want it and we'll prioritize it.

related text-to-speech models

compare all →

XTTS-v2

9.3M dl/mo

Qwen3-TTS-12Hz-1.7B-CustomVoice

2M dl/mo

Qwen3-TTS-12Hz-0.6B-CustomVoice