Whisper Medium English

openai/whisper-medium.en

published Sep 2022 · updated Jan 2024

Whisper Medium English is an automatic speech recognition model that transcribes English audio into text using a Transformer encoder-decoder architecture trained on 680k hours of weakly supervised data.

est. price

~$0.0034

· estimated, set at launch

API providers

downloads / mo

50.1K

license

apache-2.0

specs

Task	Automatic Speech Recognition (ASR)
Architecture	Transformer encoder-decoder (sequence-to-sequence)
Parameters	769 million
License	MIT

about this model

Whisper medium.en is an automatic speech recognition (ASR) model that transcribes English audio into text. It is a Transformer-based encoder-decoder (sequence-to-sequence) model trained on 680,000 hours of weakly supervised, English-only labelled data. The model is designed for zero-shot generalization, requiring no fine-tuning to perform well across diverse domains, accents, and recording conditions.

Architecture and Capabilities

Whisper medium.en contains 769 million parameters and is optimized for English speech recognition. It processes audio in 30-second segments and supports long-form transcription via chunking, enabling arbitrary-length audio processing. The model can also predict word-level timestamps.

Benchmark Performance

On LibriSpeech test-clean, the model achieves a Word Error Rate (WER) of 4.12% (official result) and 3.02% under alternative inference settings. Additional benchmark results include:

LibriSpeech test-other: 7.43% WER
AMI: 16.68% WER
Earnings22: 12.63% WER
Gigaspeech: 11.03% WER
Open ASR Leaderboard mean WER: 8.09

Inference Characteristics

The model requires approximately 5 GB of VRAM and runs at roughly 2x the inference speed of the large-v2 variant. It supports batched inference and chunked processing for audio of arbitrary length.

Training Data and Robustness

Trained on 680,000 hours of internet-sourced audio, the model demonstrates improved robustness to accents, background noise, and technical language compared to many existing ASR systems. It achieves near-state-of-the-art accuracy in a zero-shot transfer setting without fine-tuning.

best for

·English speech transcription with high accuracy
·Transcribing long audio files via chunked pipeline (up to arbitrary length)
·Zero-shot ASR on diverse English accents and domains without fine-tuning

FAQ

What is the input format for the Whisper Medium English model?

The model expects audio as a 16 kHz mono waveform, which is pre-processed into a log-Mel spectrogram. The gigarouter API accepts audio file uploads or raw audio bytes.

What is the output format?

The model outputs transcribed English text as a string. It can also return timestamped segments when requested.

How much VRAM is needed to run Whisper Medium English?

Approximately 5 GB of VRAM is required for inference.

What is the license for Whisper Medium English?

The model is released under the MIT license.

How do I call this model via the gigarouter API?

Use the gigarouter OpenAI-compatible endpoint with your API key, specifying the model as openai/whisper-medium.en and sending the audio file or bytes in the request.

not yet live

We're benchmarking and onboarding Whisper Medium English as a hosted, OpenAI-compatible API. Sign in for free credit and be ready when it lands, or tell us you want it and we'll prioritize it.

related speech-to-text models

compare all →

speaker-diarization-3.1

wav2vec2-large-xlsr-53-japanese

6.1M dl/mo

wav2vec2-large-xlsr-53-polish

4.7M dl/mo

wav2vec2-large-xlsr-53-dutch

4.1M dl/mo