Voxtral 4B TTS 2603
mistralai/Voxtral-4B-TTS-2603
published Nov 2025 · updated Mar 2026
Voxtral 4B TTS 2603 is a fast, open-weights text-to-speech model that produces lifelike speech for voice agents.
specs
| Task | Text-to-Speech (TTS) |
| Architecture | Hybrid auto-regressive + flow-matching with Voxtral Codec |
| Parameters | 4B |
| License | CC BY-NC 4.0 |
| Languages | 9 (English, French, Spanish, German, Italian, Portuguese, Dutch, Arabic, Hindi) |
| Audio Output | 24 kHz WAV, PCM, FLAC, MP3, AAC, Opus |
about this model
Capabilities
The model produces natural prosody and emotional range in English, French, Spanish, German, Italian, Portuguese, Dutch, Arabic, and Hindi. It includes 20 preset voices and supports rapid voice adaptation. Audio output is 24 kHz in WAV, PCM, FLAC, MP3, AAC, or Opus formats, with streaming and batch inference support.
Architecture and Efficiency
Voxtral uses a hybrid architecture combining auto-regressive generation of semantic speech tokens with flow-matching for acoustic tokens, using a speech tokenizer trained with hybrid VQ-FSQ quantization. On a single NVIDIA H200, it achieves a time-to-first-audio of 70 ms at concurrency 1, with a real-time factor (RTF, lower is better) of 0.103 and throughput of 119.14 characters per second per GPU.
| Concurrency | Latency | RTF | Throughput (char/s/GPU) |
|---|---|---|---|
| 1 | 70 ms | 0.103 | 119.14 |
| 16 | 331 ms | 0.237 | 879.11 |
| 32 | 552 ms | 0.302 | 1430.78 |
Benchmark Comparison
In human evaluations by native speakers, Voxtral achieved a 68.4% win rate over ElevenLabs Flash v2.5 for multilingual voice cloning naturalness and expressivity (source: research paper). Pricing starts at $0.016 per 1k characters.
gigarouter provides this model as a hosted API, eliminating infrastructure overhead and offering direct access via standard OpenAI-compatible endpoints.
best for
- ·Customer support and call center voice agents
- ·Real-time multilingual translation
- ·In-vehicle voice assistants
- ·Sales and marketing audio content
FAQ
At concurrency 1, latency is 70 ms; at 16 concurrency, 331 ms (on single H200).
WAV, PCM, FLAC, MP3, AAC, and Opus at 24 kHz.
CC BY-NC 4.0 (non-commercial).
9 languages: English, French, Spanish, German, Italian, Portuguese, Dutch, Arabic, Hindi.
Use the OpenAI-compatible endpoint with an API key. Send a POST to /v1/audio/speech with parameters: input, model, response_format, and voice.
We're benchmarking and onboarding Voxtral 4B TTS 2603 as a hosted, OpenAI-compatible API. Sign in for free credit and be ready when it lands, or tell us you want it and we'll prioritize it.