Whisper Small

openai/whisper-small

published Sep 2022 · updated Feb 2024

Whisper Small is an automatic speech recognition (ASR) and speech translation model that transcribes and translates multilingual audio using a transformer encoder-decoder architecture.

est. price

~$0.0034

· estimated, set at launch

API providers

downloads / mo

3.3M

license

apache-2.0

specs

Task	Automatic Speech Recognition (ASR) and Speech Translation
Architecture	Transformer encoder-decoder (sequence-to-sequence)
Parameters	244 million
License	Apache-2.0

about this model

openai/whisper-small is an automatic speech recognition (ASR) model that also supports speech translation and language identification. It is a Transformer encoder-decoder (sequence-to-sequence) model trained on 680,000 hours of weakly supervised multilingual data, enabling strong zero-shot generalisation across domains and languages without fine-tuning.

Key capabilities

Multilingual speech recognition: transcribe audio in the same language as the input.
Speech translation: transcribe audio into a different language (e.g., French audio to English text).
Language identification and timestamp prediction (optional).
Long-form transcription via chunking (up to 30‑second windows, arbitrary total length).

Performance benchmarks

Dataset	Word Error Rate (WER)
LibriSpeech test-clean	3.43%
LibriSpeech test-other	7.63%
Common Voice 11.0 (Hindi)	87.3%
Common Voice 13.0 (Divehi)	125.7%

Benchmark results reflect zero-shot evaluation; the model was not fine-tuned on any of these datasets.

Inference characteristics

Whisper-small (244 million parameters) requires approximately 2 GB VRAM and runs at roughly 4× the speed of the large model on an A100 GPU. The model is released under the Apache‑2.0 license.

Reference

Radford et al., Robust Speech Recognition via Large-Scale Weak Supervision (arXiv:2212.04356). Original code: github.com/openai/whisper.

best for

·Transcribing English and multilingual audio up to 30 seconds per chunk
·Translating non-English speech into English (e.g., French to English)
·Long-form audio transcription using chunked pipeline with batching

FAQ

What is the input format for the Whisper Small API?

Input should be audio data (e.g., a WAV file) sampled at 16 kHz. The model processes log-Mel spectrograms. Use the gigarouter OpenAI-compatible endpoint with an API key.

How much VRAM does Whisper Small require?

Approximately 2 GB VRAM. It runs about 4x faster than the large model on an A100 GPU.

Can Whisper Small transcribe audio longer than 30 seconds?

Yes, by using a chunking algorithm with <code>chunk_length_s=30</code> and batching, the pipeline can handle arbitrary-length audio.

What languages does Whisper Small support?

It is a multilingual model trained on 96 languages; it can perform speech recognition in the same language or translation to English.

Is fine-tuning required for domain adaptation?

No, Whisper Small generalizes to many datasets without fine-tuning, though fine-tuning can further improve performance on specific domains.

not yet live

We're benchmarking and onboarding Whisper Small as a hosted, OpenAI-compatible API. Sign in for free credit and be ready when it lands, or tell us you want it and we'll prioritize it.

related speech-to-text models

compare all →

speaker-diarization-3.1

wav2vec2-large-xlsr-53-japanese

6.1M dl/mo

wav2vec2-large-xlsr-53-polish

4.7M dl/mo

wav2vec2-large-xlsr-53-dutch

4.1M dl/mo