Higgs Audio v3 TTS 4B

multimodalart/higgs-audio-v3-tts-4b-transformers

published Jun 2026 · updated Jun 2026

Higgs Audio v3 TTS 4B is a zero-shot text-to-speech model with voice cloning, built on a Qwen3-4B backbone and a multi-codebook audio head.

est. price

~$0.0075

· estimated, set at launch

API providers

downloads / mo

62.9K

license

other

specs

Task	Text-to-Speech (TTS) with Voice Cloning
Architecture	Qwen3-4B backbone with fused multi-codebook audio embedding/head
Parameters	4 billion
License	Boson Higgs Audio v3 Research and Non-Commercial License

about this model

Higgs Audio v3 TTS (4B) is a text-to-speech model that synthesizes speech from text using a Qwen3-4B backbone combined with a multi-codebook audio embedding and head, packaged as a transformers-compatible port of the original Boson AI checkpoint.

Capabilities

The model supports zero-shot TTS and voice cloning from a reference audio clip (with optional transcript). It outputs mono 24 kHz waveforms. Generation uses a delay pattern across 8 codebooks (vocabulary size 1026, including begin-of-code and end-of-code special tokens); de-delay and decoding are handled internally. The tokenizer runs at 25 frames per second — half the frame rate of many baselines — and is trained on 24 kHz data covering speech, music, and sound events in a single unified system. Its non-diffusion encoder/decoder enables fast, batch inference without iterative denoising.

Supported Languages

The model supports approximately 100 languages (87 language tags listed in the upstream metadata).

Licensing

This model is released under the Boson Higgs Audio v3 Research and Non-Commercial License. Production, hosted, or revenue-generating use requires a separate commercial license from Boson AI.

Additional Sources

The tokenizer was evaluated across DAPS (speech), MUSDB (music), and AudioSet (sound events) with 1,000 clips per category (10 seconds each), plus an Audiophile subset of 150 clips (30 seconds each) from 11 high-fidelity test discs. Metrics included acoustic reconstruction error, semantic integrity (SeedTTS subsets), and Meta Audiobox Aesthetics scores.

The model is also available via the Boson AI API (api.boson.ai/v1/audio/speech) with an OpenAI-compatible interface, preset voices, and streaming support.

best for

·Zero-shot TTS from text prompts
·Voice cloning using a reference audio clip
·Multilingual speech generation (~100 languages)

FAQ

What is the output format of this model?

It returns a mono 24 kHz waveform as a CPU float32 tensor.

Does it support voice cloning?

Yes, you can clone a voice by providing a reference audio clip and optional transcript.

How many languages does it support?

It supports approximately 100 languages, based on 87 language tags in the upstream model.

What is the license for commercial use?

The model uses a research/non-commercial license; production or revenue-generating use requires a separate commercial license from Boson AI.

How can I call this model via API?

Use the gigarouter OpenAI-compatible endpoint with an API key.

not yet live

We're benchmarking and onboarding Higgs Audio v3 TTS 4B as a hosted, OpenAI-compatible API. Sign in for free credit and be ready when it lands, or tell us you want it and we'll prioritize it.

related text-to-speech models

compare all →

XTTS-v2

9.3M dl/mo

Qwen3-TTS-12Hz-1.7B-CustomVoice

2M dl/mo

Qwen3-TTS-12Hz-0.6B-CustomVoice