Whisper Base.en

openai/whisper-base.en

published Sep 2022 · updated Jan 2024

Whisper Base.en is an automatic speech recognition (ASR) model that transcribes English audio to text using a Transformer encoder-decoder architecture trained on 680k hours of weakly supervised data.

est. price

~$0.0034

· estimated, set at launch

API providers

downloads / mo

30.8K

license

apache-2.0

specs

Task	Automatic Speech Recognition (ASR) - English only
Architecture	Transformer encoder-decoder (sequence-to-sequence)
Parameters	74 million
Language	English only

about this model

Whisper base.en is an automatic speech recognition (ASR) model that transcribes English speech into text. It is a Transformer-based encoder-decoder (sequence-to-sequence) model trained on 680,000 hours of weakly supervised audio data, enabling strong generalization to many datasets and domains without fine-tuning. The model is optimized for English-only speech recognition. According to OpenAI, the English-only variants (".en") tend to perform better than the multilingual checkpoints for English applications, particularly for the base size. On the LibriSpeech test-clean benchmark, whisper-base.en achieves a Word Error Rate (WER) of 4.27%. On the more challenging LibriSpeech test-other set, it achieves a WER of 12.80%. On the Open ASR Leaderboard aggregated benchmark, the model reports a mean WER of 10.32 across datasets including AMI (21.13%), Earnings22 (15.09%), and GigaSpeech (12.83%). Key capabilities include: - Robustness to accents, background noise, and technical language - Zero-shot transfer to new domains without fine-tuning - Long-form transcription via chunking (up to arbitrary audio length) - Timestamp prediction for transcribed segments The model has 74 million parameters and is designed for audio samples up to 30 seconds in duration. For longer recordings, chunking with batched inference is supported. The model is hosted as a managed API on gigarouter, providing OpenAI-compatible endpoints for direct integration.

best for

·Transcribing English podcasts, interviews, and meetings
·Building real-time English speech-to-text applications
·Generating accurate subtitles for English video content

FAQ

What is Whisper Base.en best used for?

It excels at English speech recognition with low word error rate, suitable for transcribing meetings, interviews, and podcasts without fine-tuning.

How does Whisper Base.en compare in size and speed to other Whisper models?

With 74M parameters, it is a compact model requiring about 1 GB VRAM and offers approximately 7x faster inference than the large model on an A100 GPU.

What input format does the model expect?

The model expects audio as a 16 kHz mono waveform, typically pre-processed into log-Mel spectrograms via the WhisperProcessor.

How can I transcribe audio longer than 30 seconds?

Use the Transformers pipeline with chunk_length_s=30 to enable chunking; timestamps can be returned with return_timestamps=True.

How do I call this model via the GigaRouter API?

Use the GigaRouter OpenAI-compatible endpoint with your API key, sending the audio file and specifying the model as "whisper-base.en".

not yet live

We're benchmarking and onboarding Whisper Base.en as a hosted, OpenAI-compatible API. Sign in for free credit and be ready when it lands, or tell us you want it and we'll prioritize it.

related speech-to-text models

compare all →

speaker-diarization-3.1

wav2vec2-large-xlsr-53-japanese

6.1M dl/mo

wav2vec2-large-xlsr-53-polish

4.7M dl/mo

wav2vec2-large-xlsr-53-dutch

4.1M dl/mo