Whisper Tiny
openai/whisper-tiny
published Sep 2022 · updated Feb 2024
Whisper Tiny is an automatic speech recognition (ASR) model that transcribes and translates speech across multiple languages using a Transformer encoder-decoder architecture trained on 680k hours of weakly supervised data.
specs
| Task | Automatic Speech Recognition (ASR) & Speech Translation |
| Architecture | Transformer encoder-decoder (sequence-to-sequence) |
| Parameters | 39 M |
| License | MIT |
about this model
openai/whisper-tiny is an automatic speech recognition (ASR) model that transcribes audio to text and can also perform speech translation, trained on 680,000 hours of weakly supervised multilingual data.
Architecture and training
Whisper uses a Transformer encoder-decoder (sequence-to-sequence) architecture. The model was trained on 680k hours of labelled speech: 65% English-only (438k hours), 18% non-English audio with English transcripts (126k hours), and 17% non-English audio with native transcripts (117k hours), covering 98 languages. It was proposed in the paper Robust Speech Recognition via Large-Scale Weak Supervision (Radford et al., OpenAI).
Key strengths and benchmarks
Whisper-tiny generalizes to many domains without fine-tuning. On LibriSpeech test-clean, it achieves a Word Error Rate (WER) of 7.55%. The tiny variant (39 million parameters) requires approximately 1 GB VRAM and runs about 10x faster than the large model. It supports both transcription (same language as audio) and translation (to English). Long audio can be transcribed by chunking into 30-second segments, with optional timestamp prediction.
Model sizes
| Size | Parameters | English-only | Multilingual |
|---|---|---|---|
| tiny | 39 M | ✓ | ✓ |
| base | 74 M | ✓ | ✓ |
| small | 244 M | ✓ | ✓ |
| medium | 769 M | ✓ | ✓ |
| large | 1550 M | ✗ | ✓ |
| large-v2 | 1550 M | ✗ | ✓ |
Known limitations
Due to weak supervision on noisy data, the model may produce hallucinations (text not present in the audio). Accuracy varies by language, with lower performance on low-resource languages that have less training data.
best for
- ·Transcribing short audio clips (up to 30 seconds) in multiple languages
- ·Speech translation from non-English audio to English text
- ·Language identification from spoken audio
FAQ
The API accepts audio as a file upload or a base64-encoded PCM 16-bit mono 16 kHz waveform. The model internally converts audio to log-Mel spectrograms.
Whisper Tiny is the smallest and fastest model, roughly 10x faster than large and requires about 1 GB VRAM.
It supports 98 languages for speech recognition and can translate from many of those languages into English. Performance varies by language, especially for low-resource ones.
Use the OpenAI-compatible endpoint with your gigarouter API key, sending a POST request to the /v1/audio/transcriptions or /v1/audio/translations path with the audio file.
The MIT license allows free use, modification, and distribution. The model can be deployed locally using the openai-whisper Python package and a compatible GPU, but gigarouter provides a hosted API.
We're benchmarking and onboarding Whisper Tiny as a hosted, OpenAI-compatible API. Sign in for free credit and be ready when it lands, or tell us you want it and we'll prioritize it.