Whisper Base
openai/whisper-base
published Sep 2022 · updated Feb 2024
Whisper Base is a speech recognition and translation model that transcribes and translates multilingual audio using a Transformer encoder-decoder architecture.
specs
| Task | Automatic Speech Recognition (ASR) & Speech Translation |
| Architecture | Transformer Encoder-Decoder (Sequence-to-Sequence) |
| Parameters | 74 Million |
| License | Apache 2.0 |
about this model
Whisper Base is a multilingual automatic speech recognition (ASR) and speech translation model trained on 680,000 hours of weakly supervised audio data. It employs a Transformer encoder-decoder architecture and can perform transcription in the same language or translation to English, along with language identification and voice activity detection—all controlled via special context tokens.
With only 74 million parameters, the model demonstrates strong zero-shot generalization across diverse domains and benchmarks without requiring fine-tuning. It natively processes audio segments up to 30 seconds; longer recordings can be transcribed using chunking.
Performance
| Dataset | Word Error Rate (WER) |
|---|---|
| LibriSpeech test-clean | 5.0088% |
| LibriSpeech test-other | 12.8494% |
| Common Voice 11.0 Hindi | 131% |
Performance on low-resource languages like Hindi is poor, reflecting the model's training distribution. For English-only tasks, the whisper-base.en variant typically achieves better accuracy.
Inference Characteristics
Whisper Base runs approximately 7× faster than the large-v2 model and requires roughly 1 GB of VRAM, making it well-suited for latency-sensitive or resource-constrained deployments. The model is released under the Apache 2.0 license.
best for
- ·Transcribing short audio clips (up to 30 seconds) in multiple languages
- ·Translating non-English speech into English text
FAQ
Whisper Base achieves 5.0088% WER on LibriSpeech test-clean and 12.8494% on test-other.
Whisper Base is approximately 7 times faster than the large model and requires about 1 GB of VRAM.
The API accepts audio files (e.g., WAV, MP3) and streams; refer to gigarouter documentation for exact format requirements.
Yes, by using chunking (e.g., 30-second segments) via the pipeline, it can transcribe audio of arbitrary length.
Yes, the multilingual model supports speech recognition and translation for many languages. The English-only variant (whisper-base.en) is also available for English-only tasks.
We're benchmarking and onboarding Whisper Base as a hosted, OpenAI-compatible API. Sign in for free credit and be ready when it lands, or tell us you want it and we'll prioritize it.