models / speech-to-text · coming soon

Kyutai STT 2.6B English

kyutai/stt-2.6b-en

published Jun 2025 · updated Jun 2025

Kyutai STT 2.6B English is a streaming speech-to-text model that transcribes English audio with a 2.5 second delay.

est. price

~$0.0034

· estimated, set at launch

API providers

license

cc-by-4.0

specs

Task	Automatic Speech Recognition (ASR) / Streaming Speech-to-Text
Architecture	Decoder-only Transformer with Mimi audio tokenizer
Parameters	~2.6 billion
License	CC-BY 4.0

about this model

Kyutai/stt-2.6b-en is a streaming automatic speech recognition (ASR) model that transcribes English audio into text with punctuation and capitalization, producing output as soon as a few seconds of audio become available.

Key Capabilities

Streaming inference: processes audio in chunks for real-time transcription, suitable for interactive applications.
Returns word-level timestamps for each transcribed token.
Robust to noisy conditions; performs reliably on audio segments up to 2 hours without additional adaptation.
Based on a decoder-only Transformer architecture that consumes audio tokenized by Mimi (12.5 Hz frame rate, 32 audio tokens per frame) and outputs text tokens. The text stream is shifted by a 2.5-second delay relative to the audio stream.

Performance

On a single H100 GPU, the model can batch-process 400 audio streams in real time.
A single L40S GPU serves 64 simultaneous streaming connections via a Rust websocket server at a 3x real-time factor.

Training Details

The model was pretrained on 2.5 million hours of publicly available audio with synthetic transcripts from Whisper-timestamped, then fine-tuned on 24,000 hours of ground-truth transcribed public datasets, followed by a long-form fine-tuning stage using concatenated LibriSpeech examples and synthesized dialogs (total 23,000 hours).

Additional Features

The model outputs transcripts that include capitalization and punctuation. Word-level timestamps can be derived by subtracting the 2.5-second text stream offset from the audio frame offset. The model is English-only (language identifier en) and released under CC-BY 4.0. Parameter count is approximately 2.7 billion (verified via safetensors metadata).

best for

·Real-time transcription of live audio streams or voice calls
·Building voice agents that require word-level timestamps and low latency

FAQ

What is the streaming latency of this model?

The model introduces a 2.5 second delay between audio input and text output.

What license covers the model weights?

The model weights are licensed under CC-BY 4.0.

Does the model return word-level timestamps?

Yes, it returns word-level timestamps along with the transcript.

How do I call this model via the gigarouter API?

Use the OpenAI-compatible endpoint with your gigarouter API key and specify the model ID kyutai/stt-2.6b-en.

What is the input format for audio?

The model accepts audio tokenized by the Mimi codec; for API usage, raw audio is processed into the required format by gigarouter.

not yet live

We're benchmarking and onboarding Kyutai STT 2.6B English as a hosted, OpenAI-compatible API. Sign in for free credit and be ready when it lands, or tell us you want it and we'll prioritize it.

related speech-to-text models

compare all →

speaker-diarization-3.1

wav2vec2-large-xlsr-53-japanese

6.1M dl/mo

wav2vec2-large-xlsr-53-polish

4.7M dl/mo

wav2vec2-large-xlsr-53-dutch

4.1M dl/mo