Voxtral Mini 3B

mistralai/Voxtral-Mini-3B-2507

published Jul 2025 · updated Jul 2025

Voxtral Mini 3B is a multimodal audio chat model that excels at speech transcription, translation, and audio understanding while retaining strong text capabilities.

status

coming soon

API providers

downloads / mo

310.2K

license

apache-2.0

specs

Task	Speech Understanding & Transcription
Architecture	Whisper large-v3 encoder + Ministral 3B decoder
Parameters	3B
License	Apache 2.0
Context Length	32k tokens
GPU Memory (bf16)	~9.5 GB

about this model

Voxtral-Mini-3B-2507 is a multimodal audio chat model that combines speech transcription, translation, and understanding with text capabilities, built on the Ministral-3B language model backbone.

Key Features

Dedicated transcription mode: Automatically detects source language and transcribes speech with high accuracy; uses temperature=0.0 for optimal results.
Long-form context: Supports up to 32k tokens, handling audio of up to 30 minutes for transcription or 40 minutes for understanding tasks.
Built-in Q&A and summarization: Processes spoken questions and generates structured summaries without requiring a separate ASR pipeline.
Multilingual: State-of-the-art performance in English, Spanish, French, Portuguese, Hindi, German, Dutch, and Italian, with automatic language detection.
Function calling from voice: Enables direct triggering of backend functions, workflows, or API calls based on spoken intents.
Text capability preserved: Retains the full text understanding performance of Ministral-3B.

Benchmark Results

Audio: Average word error rate (WER) across the FLEURS, Mozilla Common Voice, and Multilingual LibriSpeech benchmarks:

Bar chart showing WER for Voxtral Mini vs. baseline models on FLEURS, Common Voice, and Multilingual LibriSpeech

Text: Retained text benchmark performance (e.g., MMLU, HellaSwag) compared to Ministral-3B:

Bar chart comparing text benchmark scores between Voxtral Mini and Ministral-3B

Architecture and Performance

Audio encoder based on Whisper large-v3 with 50 Hz frame rate; adapter layer downsamples audio embeddings by 4×; processes 30-second audio chunks independently (log-Mel spectrograms with 128 bins, 160 hop-length).
Three new speech evaluation benchmarks introduced: MMLU Speech, TriviaQA Speech, and GSM8K Speech.
Comparison models: Voxtral Mini is benchmarked against GPT-4o mini and Gemini 2.5 Flash for speech QA and summarization.
GPU memory requirement: approximately 9.5 GB in bf16 or fp16 precision.
Released under Apache 2.0 license (see research paper for full details).

best for

·Transcribing long audio files (up to 30 minutes) with automatic language detection
·Multilingual audio understanding, summarization, and Q&A without separate ASR
·Voice-enabled function calling and workflow automation from spoken intents
·Comparing and analyzing multiple audio clips in a single conversation

FAQ

What tasks is Voxtral Mini 3B best for?

It is designed for speech transcription, translation, audio understanding, summarization, and voice-driven function calling, all while retaining strong text capabilities.

What are the input and output formats?

Input: audio files (e.g., MP3) and text. Output: transcribed text or chat responses. It supports multiple audios per message and multi-turn conversations.

How much GPU memory is needed to run this model?

Running Voxtral Mini 3B requires approximately 9.5 GB of GPU RAM in bf16 or fp16 precision.

What license is Voxtral Mini 3B released under?

It is released under the Apache 2.0 license.

How can I call this model via the gigarouter API?

Use the OpenAI-compatible endpoint provided by gigarouter with your API key. Set the model name to mistralai/Voxtral-Mini-3B-2507 and send requests for chat completions or audio transcriptions.

not yet live

We're benchmarking and onboarding Voxtral Mini 3B as a hosted, OpenAI-compatible API. Sign in for free credit and be ready when it lands, or tell us you want it and we'll prioritize it.

related specialist model models

compare all →

electra-base-discriminator

wespeaker-voxceleb-resnet34-LM

6.8M dl/mo

unidepth-v2-vitl14

6.3M dl/mo