models / text-to-speech · coming soon

Voxtral 4B TTS 2603

mistralai/Voxtral-4B-TTS-2603

published Nov 2025 · updated Mar 2026

Voxtral 4B TTS 2603 is a fast, open-weights text-to-speech model that produces lifelike speech for voice agents.

status

coming soon

API providers

downloads / mo

74.5K

license

cc-by-nc-4.0

specs

Task	Text-to-Speech (TTS)
Architecture	Hybrid auto-regressive + flow-matching with Voxtral Codec
Parameters	4B
License	CC BY-NC 4.0
Languages	9 (English, French, Spanish, German, Italian, Portuguese, Dutch, Arabic, Hindi)
Audio Output	24 kHz WAV, PCM, FLAC, MP3, AAC, Opus

about this model

Voxtral-4B-TTS-2603 is a text-to-speech model that generates lifelike, expressive speech with low latency and multilingual support across 9 languages. It is designed for production voice agent workflows and is hosted by gigarouter as a managed, OpenAI-compatible API.

Capabilities

The model produces natural prosody and emotional range in English, French, Spanish, German, Italian, Portuguese, Dutch, Arabic, and Hindi. It includes 20 preset voices and supports rapid voice adaptation. Audio output is 24 kHz in WAV, PCM, FLAC, MP3, AAC, or Opus formats, with streaming and batch inference support.

Architecture and Efficiency

Voxtral uses a hybrid architecture combining auto-regressive generation of semantic speech tokens with flow-matching for acoustic tokens, using a speech tokenizer trained with hybrid VQ-FSQ quantization. On a single NVIDIA H200, it achieves a time-to-first-audio of 70 ms at concurrency 1, with a real-time factor (RTF, lower is better) of 0.103 and throughput of 119.14 characters per second per GPU.

Concurrency	Latency	RTF	Throughput (char/s/GPU)
1	70 ms	0.103	119.14
16	331 ms	0.237	879.11
32	552 ms	0.302	1430.78

Benchmark Comparison

In human evaluations by native speakers, Voxtral achieved a 68.4% win rate over ElevenLabs Flash v2.5 for multilingual voice cloning naturalness and expressivity (source: research paper). Pricing starts at $0.016 per 1k characters.

gigarouter provides this model as a hosted API, eliminating infrastructure overhead and offering direct access via standard OpenAI-compatible endpoints.

best for

·Customer support and call center voice agents
·Real-time multilingual translation
·In-vehicle voice assistants
·Sales and marketing audio content

FAQ

What is the latency of Voxtral 4B TTS 2603?

At concurrency 1, latency is 70 ms; at 16 concurrency, 331 ms (on single H200).

What audio formats are supported?

WAV, PCM, FLAC, MP3, AAC, and Opus at 24 kHz.

What is the license?

CC BY-NC 4.0 (non-commercial).

How many languages does it support?

9 languages: English, French, Spanish, German, Italian, Portuguese, Dutch, Arabic, Hindi.

How do I call it via the gigarouter API?

Use the OpenAI-compatible endpoint with an API key. Send a POST to /v1/audio/speech with parameters: input, model, response_format, and voice.

not yet live

We're benchmarking and onboarding Voxtral 4B TTS 2603 as a hosted, OpenAI-compatible API. Sign in for free credit and be ready when it lands, or tell us you want it and we'll prioritize it.

related text-to-speech models

compare all →

XTTS-v2

9.3M dl/mo

Qwen3-TTS-12Hz-1.7B-CustomVoice

2M dl/mo

Qwen3-TTS-12Hz-0.6B-CustomVoice