skip to content
gigarouter gigarouter
models / speech-to-text · coming soon

Whisper Small

openai/whisper-small

published Sep 2022 · updated Feb 2024

Whisper Small is an automatic speech recognition (ASR) and speech translation model that transcribes and translates multilingual audio using a transformer encoder-decoder architecture.

est. price
~$0.0034
· estimated, set at launch
API providers
0
downloads / mo
3.3M
license
apache-2.0

specs

TaskAutomatic Speech Recognition (ASR) and Speech Translation
ArchitectureTransformer encoder-decoder (sequence-to-sequence)
Parameters244 million
LicenseApache-2.0

about this model

openai/whisper-small is an automatic speech recognition (ASR) model that also supports speech translation and language identification. It is a Transformer encoder-decoder (sequence-to-sequence) model trained on 680,000 hours of weakly supervised multilingual data, enabling strong zero-shot generalisation across domains and languages without fine-tuning.

Key capabilities

  • Multilingual speech recognition: transcribe audio in the same language as the input.
  • Speech translation: transcribe audio into a different language (e.g., French audio to English text).
  • Language identification and timestamp prediction (optional).
  • Long-form transcription via chunking (up to 30‑second windows, arbitrary total length).

Performance benchmarks

DatasetWord Error Rate (WER)
LibriSpeech test-clean3.43%
LibriSpeech test-other7.63%
Common Voice 11.0 (Hindi)87.3%
Common Voice 13.0 (Divehi)125.7%

Benchmark results reflect zero-shot evaluation; the model was not fine-tuned on any of these datasets.

Inference characteristics

Whisper-small (244 million parameters) requires approximately 2 GB VRAM and runs at roughly 4× the speed of the large model on an A100 GPU. The model is released under the Apache‑2.0 license.

Reference

Radford et al., Robust Speech Recognition via Large-Scale Weak Supervision (arXiv:2212.04356). Original code: github.com/openai/whisper.

best for

FAQ

What is the input format for the Whisper Small API?

Input should be audio data (e.g., a WAV file) sampled at 16 kHz. The model processes log-Mel spectrograms. Use the gigarouter OpenAI-compatible endpoint with an API key.

How much VRAM does Whisper Small require?

Approximately 2 GB VRAM. It runs about 4x faster than the large model on an A100 GPU.

Can Whisper Small transcribe audio longer than 30 seconds?

Yes, by using a chunking algorithm with <code>chunk_length_s=30</code> and batching, the pipeline can handle arbitrary-length audio.

What languages does Whisper Small support?

It is a multilingual model trained on 96 languages; it can perform speech recognition in the same language or translation to English.

Is fine-tuning required for domain adaptation?

No, Whisper Small generalizes to many datasets without fine-tuning, though fine-tuning can further improve performance on specific domains.

not yet live

We're benchmarking and onboarding Whisper Small as a hosted, OpenAI-compatible API. Sign in for free credit and be ready when it lands, or tell us you want it and we'll prioritize it.

related speech-to-text models

compare all →