Whisper Small English
openai/whisper-small.en
published Sep 2022 · updated Jan 2024
Whisper Small English is a Transformer-based automatic speech recognition model for English, trained on 680k hours of weakly supervised data.
est. price
~$0.0034
· estimated, set at launch
specs
| Task | Automatic Speech Recognition (ASR) |
| Architecture | Transformer encoder-decoder (sequence-to-sequence) |
| Parameters | 244 million |
| Language | English-only |
about this model
Whisper-small.en is an automatic speech recognition (ASR) model that transcribes English speech into text. It is a Transformer sequence-to-sequence model trained on 680,000 hours of weakly supervised audio data, giving it a strong ability to generalize to new domains without fine-tuning. The model achieves a word error rate (WER) of 3.05% on the LibriSpeech test-clean benchmark, demonstrating high accuracy on standard English ASR tasks. It is designed for single-language English recognition and tends to outperform its multilingual counterpart for English transcription, though the gap is smaller at the small size.
Key features include support for long-form transcription via chunking (up to arbitrary audio lengths) and the ability to predict sequence-level timestamps. The small.en variant has 244 million parameters, requires approximately 2 GB of VRAM, and offers roughly 4x faster inference than the large model on an A100 GPU. The model is robust to accents, background noise, and technical language, but may occasionally produce hallucinations or repetitive text, and its accuracy varies across languages (for multilingual versions) and demographic groups, as noted in the original study.
The model is hosted as a managed API on gigarouter, eliminating the need for local installation or GPU management. Users simply call an OpenAI-compatible endpoint to transcribe English audio with production-ready performance.
best for
- ·Transcribing English audio in real-time or batch
- ·Transcribing long audio recordings (e.g., meetings, podcasts) with chunking
- ·Integrating speech-to-text into applications or services
FAQ
What is the primary use case for Whisper Small English?
It is designed for English speech recognition and can transcribe audio to text with high accuracy and robustness to accents, background noise, and technical language.
How does Whisper Small English compare in size and speed to larger Whisper models?
It has 244M parameters, requires about 2GB VRAM, and runs approximately 4x faster than the Whisper large model on an A100 GPU.
What input format does the model expect?
It expects audio preprocessed into log-Mel spectrograms at a 16kHz sampling rate. The model handles audio chunks of up to 30 seconds; longer audio can be processed via chunking.
How can I use this model via the gigarouter API?
Send requests to the gigarouter OpenAI-compatible endpoint with an API key and the audio data in the request body. The response will contain the transcribed text.
Does the model support timestamps in the transcription?
Yes, by using the pipeline with `return_timestamps=True`, you can obtain segment-level timestamps for each transcribed phrase.
not yet live
We're benchmarking and onboarding Whisper Small English as a hosted, OpenAI-compatible API. Sign in for free credit and be ready when it lands, or tell us you want it and we'll prioritize it.