Whisper Small
openai/whisper-small
published Sep 2022 · updated Feb 2024
Whisper Small is an automatic speech recognition (ASR) and speech translation model that transcribes and translates multilingual audio using a transformer encoder-decoder architecture.
specs
| Task | Automatic Speech Recognition (ASR) and Speech Translation |
| Architecture | Transformer encoder-decoder (sequence-to-sequence) |
| Parameters | 244 million |
| License | Apache-2.0 |
about this model
openai/whisper-small is an automatic speech recognition (ASR) model that also supports speech translation and language identification. It is a Transformer encoder-decoder (sequence-to-sequence) model trained on 680,000 hours of weakly supervised multilingual data, enabling strong zero-shot generalisation across domains and languages without fine-tuning.
Key capabilities
- Multilingual speech recognition: transcribe audio in the same language as the input.
- Speech translation: transcribe audio into a different language (e.g., French audio to English text).
- Language identification and timestamp prediction (optional).
- Long-form transcription via chunking (up to 30‑second windows, arbitrary total length).
Performance benchmarks
| Dataset | Word Error Rate (WER) |
|---|---|
| LibriSpeech test-clean | 3.43% |
| LibriSpeech test-other | 7.63% |
| Common Voice 11.0 (Hindi) | 87.3% |
| Common Voice 13.0 (Divehi) | 125.7% |
Benchmark results reflect zero-shot evaluation; the model was not fine-tuned on any of these datasets.
Inference characteristics
Whisper-small (244 million parameters) requires approximately 2 GB VRAM and runs at roughly 4× the speed of the large model on an A100 GPU. The model is released under the Apache‑2.0 license.
Reference
Radford et al., Robust Speech Recognition via Large-Scale Weak Supervision (arXiv:2212.04356). Original code: github.com/openai/whisper.
best for
- ·Transcribing English and multilingual audio up to 30 seconds per chunk
- ·Translating non-English speech into English (e.g., French to English)
- ·Long-form audio transcription using chunked pipeline with batching
FAQ
Input should be audio data (e.g., a WAV file) sampled at 16 kHz. The model processes log-Mel spectrograms. Use the gigarouter OpenAI-compatible endpoint with an API key.
Approximately 2 GB VRAM. It runs about 4x faster than the large model on an A100 GPU.
Yes, by using a chunking algorithm with <code>chunk_length_s=30</code> and batching, the pipeline can handle arbitrary-length audio.
It is a multilingual model trained on 96 languages; it can perform speech recognition in the same language or translation to English.
No, Whisper Small generalizes to many datasets without fine-tuning, though fine-tuning can further improve performance on specific domains.
We're benchmarking and onboarding Whisper Small as a hosted, OpenAI-compatible API. Sign in for free credit and be ready when it lands, or tell us you want it and we'll prioritize it.