Higgs Audio v3 TTS 4B
multimodalart/higgs-audio-v3-tts-4b-transformers
published Jun 2026 · updated Jun 2026
Higgs Audio v3 TTS 4B is a zero-shot text-to-speech model with voice cloning, built on a Qwen3-4B backbone and a multi-codebook audio head.
specs
| Task | Text-to-Speech (TTS) with Voice Cloning |
| Architecture | Qwen3-4B backbone with fused multi-codebook audio embedding/head |
| Parameters | 4 billion |
| License | Boson Higgs Audio v3 Research and Non-Commercial License |
about this model
Higgs Audio v3 TTS (4B) is a text-to-speech model that synthesizes speech from text using a Qwen3-4B backbone combined with a multi-codebook audio embedding and head, packaged as a transformers-compatible port of the original Boson AI checkpoint.
Capabilities
The model supports zero-shot TTS and voice cloning from a reference audio clip (with optional transcript). It outputs mono 24 kHz waveforms. Generation uses a delay pattern across 8 codebooks (vocabulary size 1026, including begin-of-code and end-of-code special tokens); de-delay and decoding are handled internally. The tokenizer runs at 25 frames per second — half the frame rate of many baselines — and is trained on 24 kHz data covering speech, music, and sound events in a single unified system. Its non-diffusion encoder/decoder enables fast, batch inference without iterative denoising.
Supported Languages
The model supports approximately 100 languages (87 language tags listed in the upstream metadata).
Licensing
This model is released under the Boson Higgs Audio v3 Research and Non-Commercial License. Production, hosted, or revenue-generating use requires a separate commercial license from Boson AI.
Additional Sources
The tokenizer was evaluated across DAPS (speech), MUSDB (music), and AudioSet (sound events) with 1,000 clips per category (10 seconds each), plus an Audiophile subset of 150 clips (30 seconds each) from 11 high-fidelity test discs. Metrics included acoustic reconstruction error, semantic integrity (SeedTTS subsets), and Meta Audiobox Aesthetics scores.
The model is also available via the Boson AI API (api.boson.ai/v1/audio/speech) with an OpenAI-compatible interface, preset voices, and streaming support.
best for
- ·Zero-shot TTS from text prompts
- ·Voice cloning using a reference audio clip
- ·Multilingual speech generation (~100 languages)
FAQ
It returns a mono 24 kHz waveform as a CPU float32 tensor.
Yes, you can clone a voice by providing a reference audio clip and optional transcript.
It supports approximately 100 languages, based on 87 language tags in the upstream model.
The model uses a research/non-commercial license; production or revenue-generating use requires a separate commercial license from Boson AI.
Use the gigarouter OpenAI-compatible endpoint with an API key.
We're benchmarking and onboarding Higgs Audio v3 TTS 4B as a hosted, OpenAI-compatible API. Sign in for free credit and be ready when it lands, or tell us you want it and we'll prioritize it.