Whisper Base

openai/whisper-base

published Sep 2022 · updated Feb 2024

Whisper Base is a speech recognition and translation model that transcribes and translates multilingual audio using a Transformer encoder-decoder architecture.

est. price

~$0.0034

· estimated, set at launch

API providers

downloads / mo

6.4M

license

apache-2.0

specs

Task	Automatic Speech Recognition (ASR) & Speech Translation
Architecture	Transformer Encoder-Decoder (Sequence-to-Sequence)
Parameters	74 Million
License	Apache 2.0

about this model

Whisper Base is a multilingual automatic speech recognition (ASR) and speech translation model trained on 680,000 hours of weakly supervised audio data. It employs a Transformer encoder-decoder architecture and can perform transcription in the same language or translation to English, along with language identification and voice activity detection—all controlled via special context tokens.

With only 74 million parameters, the model demonstrates strong zero-shot generalization across diverse domains and benchmarks without requiring fine-tuning. It natively processes audio segments up to 30 seconds; longer recordings can be transcribed using chunking.

Performance

Dataset	Word Error Rate (WER)
LibriSpeech test-clean	5.0088%
LibriSpeech test-other	12.8494%
Common Voice 11.0 Hindi	131%

Performance on low-resource languages like Hindi is poor, reflecting the model's training distribution. For English-only tasks, the whisper-base.en variant typically achieves better accuracy.

Inference Characteristics

Whisper Base runs approximately 7× faster than the large-v2 model and requires roughly 1 GB of VRAM, making it well-suited for latency-sensitive or resource-constrained deployments. The model is released under the Apache 2.0 license.

best for

·Transcribing short audio clips (up to 30 seconds) in multiple languages
·Translating non-English speech into English text

FAQ

What is the Word Error Rate (WER) of whisper-base on LibriSpeech test-clean?

Whisper Base achieves 5.0088% WER on LibriSpeech test-clean and 12.8494% on test-other.

How does whisper-base compare in speed to the large model?

Whisper Base is approximately 7 times faster than the large model and requires about 1 GB of VRAM.

What input formats does the gigarouter API accept?

The API accepts audio files (e.g., WAV, MP3) and streams; refer to gigarouter documentation for exact format requirements.

Can whisper-base handle long audio recordings?

Yes, by using chunking (e.g., 30-second segments) via the pipeline, it can transcribe audio of arbitrary length.

Is whisper-base multilingual?

Yes, the multilingual model supports speech recognition and translation for many languages. The English-only variant (whisper-base.en) is also available for English-only tasks.

not yet live

We're benchmarking and onboarding Whisper Base as a hosted, OpenAI-compatible API. Sign in for free credit and be ready when it lands, or tell us you want it and we'll prioritize it.

related speech-to-text models

compare all →

speaker-diarization-3.1

8.2M dl/mo

whisperkit-coreml

8M dl/mo

wav2vec2-large-xlsr-53-japanese

6.1M dl/mo

wav2vec2-large-xlsr-53-polish

4.7M dl/mo

wav2vec2-large-xlsr-53-dutch

4.1M dl/mo

wav2vec2-indonesian-javanese-sundanese

4.1M dl/mo