Hosted speech-to-text models
53 models · 0 live as APIs · benchmarked & compared
Speech-to-text models convert spoken audio into written text, enabling applications such as real-time captioning, meeting transcription, voice-controlled interfaces, and automated subtitling. Speaker diarization models—such as pyannote/speaker-diarization-3.1—extend this by identifying who spoke when, which is critical for multi-speaker recordings like conference calls or interviews.
In production, these models are typically deployed in pipelines that include voice activity detection, language identification, and post-processing for punctuation and formatting. The choice among models involves a trade-off between transcription accuracy, latency, and computational cost. For example, openai/whisper-base offers a fast, compact option, while larger variants or specialized models like jonatasgrosman/wav2vec2-large-xlsr-53-japanese are tuned for specific languages or higher accuracy at the expense of speed and memory.
This page lists 30 speech-to-text models (0 currently live, the remainder being onboarded), including pyannote/speaker-diarization-3.1, argmaxinc/whisperkit-coreml, openai/whisper-base, and several wav2vec2 variants. Calling a
compare
| model | params | downloads/mo | price | status |
|---|---|---|---|---|
| pyannote/speaker-diarization-3.1 | - | 8.2M | at launch | coming soon |
| argmaxinc/whisperkit-coreml | - | 8M | at launch | coming soon |
| openai/whisper-base | 72.6M | 6.4M | ~$0.0034 / minute | coming soon |
| jonatasgrosman/wav2vec2-large-xlsr-53-japanese | - | 6.1M | at launch | coming soon |
| jonatasgrosman/wav2vec2-large-xlsr-53-polish | - | 4.7M | at launch | coming soon |
| jonatasgrosman/wav2vec2-large-xlsr-53-dutch | - | 4.1M | at launch | coming soon |
| indonesian-nlp/wav2vec2-indonesian-javanese-sundanese | - | 4.1M | at launch | coming soon |
| pyannote/speaker-diarization-community-1 | - | 4M | at launch | coming soon |
| jonatasgrosman/wav2vec2-large-xlsr-53-arabic | - | 3.5M | at launch | coming soon |
| jonatasgrosman/wav2vec2-large-xlsr-53-hungarian | - | 3.4M | at launch | coming soon |
| openai/whisper-small | 241.7M | 3.3M | ~$0.0034 / minute | coming soon |
| MahmoudAshraf/mms-300m-1130-forced-aligner | 315.5M | 3.2M | ~$0.0034 / minute | coming soon |
| jonatasgrosman/wav2vec2-large-xlsr-53-portuguese | - | 3.2M | at launch | coming soon |
| jonatasgrosman/wav2vec2-large-xlsr-53-russian | - | 2.9M | at launch | coming soon |
| gigant/romanian-wav2vec2 | 315.5M | 2.8M | ~$0.0034 / minute | coming soon |
| anuragshas/wav2vec2-large-xlsr-53-telugu | - | 2.8M | at launch | coming soon |
| jonatasgrosman/wav2vec2-large-xlsr-53-persian | - | 2.5M | at launch | coming soon |
| KBLab/wav2vec2-large-voxrex-swedish | 315.5M | 2.5M | ~$0.0034 / minute | coming soon |
| kingabzpro/wav2vec2-large-xls-r-300m-Urdu | 315.5M | 2.3M | ~$0.0034 / minute | coming soon |
| theainerd/Wav2Vec2-large-xlsr-hindi | 315.5M | 2.1M | ~$0.0034 / minute | coming soon |
| pyannote/voice-activity-detection | - | 2M | at launch | coming soon |
| mistralai/Voxtral-Mini-4B-Realtime-2602 | 4429.7M | 2M | ~$0.0034 / minute | coming soon |
| imvladikon/wav2vec2-xls-r-300m-hebrew | 315.5M | 1.8M | ~$0.0034 / minute | coming soon |
| mesolitica/wav2vec2-xls-r-300m-mixed | - | 1.8M | at launch | coming soon |
| airesearch/wav2vec2-large-xlsr-53-th | - | 1.7M | at launch | coming soon |
| openai/whisper-tiny | 37.8M | 1.6M | ~$0.0034 / minute | coming soon |
| jonatasgrosman/wav2vec2-large-xlsr-53-chinese-zh-cn | - | 1.5M | at launch | coming soon |
| mlx-community/parakeet-tdt-0.6b-v2 | - | 1.5M | at launch | coming soon |
| arijitx/wav2vec2-xls-r-300m-bengali | - | 1.4M | at launch | coming soon |
| Systran/faster-whisper-base | - | 1.4M | at launch | coming soon |
| Qwen/Qwen3-ASR-1.7B | 2349.2M | 1.4M | ~$0.0034 / minute | coming soon |
| Qwen/Qwen3-ASR-0.6B | 938M | 941.1K | ~$0.0034 / minute | coming soon |
| nvidia/parakeet-ctc-1.1b | 1062.6M | 781.7K | ~$0.0034 / minute | coming soon |
| microsoft/Phi-4-multimodal-instruct | 5574.5M | 541.1K | ~$0.0034 / minute | coming soon |
| zai-org/GLM-ASR-Nano-2512 | 2257.8M | 133.7K | ~$0.0034 / minute | coming soon |
| openai/whisper-large-v2 | 1543.3M | 115K | ~$0.0034 / minute | coming soon |
| openai/whisper-medium.en | 763.9M | 50.1K | ~$0.0034 / minute | coming soon |
| openai/whisper-small.en | 241.7M | 45.8K | ~$0.0034 / minute | coming soon |
| UsefulSensors/moonshine-base | 61.5M | 40.6K | ~$0.0034 / minute | coming soon |
| nvidia/parakeet-rnnt-0.6b | 616.7M | 36.6K | ~$0.0034 / minute | coming soon |
| openai/whisper-large | 1543.3M | 35K | ~$0.0034 / minute | coming soon |
| openai/whisper-base.en | 72.6M | 30.8K | ~$0.0034 / minute | coming soon |
| nvidia/parakeet-ctc-0.6b | 608.8M | 15.3K | ~$0.0034 / minute | coming soon |
| UsefulSensors/moonshine-streaming-medium | 265.9M | 12.9K | ~$0.0034 / minute | coming soon |
| UsefulSensors/moonshine-streaming-small | 140.1M | 6.1K | ~$0.0034 / minute | coming soon |
| nvidia/canary-1b-flash | 811M | 3.9K | ~$0.0034 / minute | coming soon |
| distil-whisper/distil-large-v3.5 | 756.4M | 3K | ~$0.0034 / minute | coming soon |
| nvidia/parakeet-rnnt-1.1b | 1070.5M | 2.4K | ~$0.0034 / minute | coming soon |
| AutoArk-AI/ARK-ASR-3B | 4063.4M | 1.7K | ~$0.0034 / minute | coming soon |
| AutoArk-AI/ARK-ASR-0.6B | 1299.5M | 1.6K | ~$0.0034 / minute | coming soon |
| OpenMOSS-Team/MOSS-Transcribe-preview-2B | 2418.8M | 879 | ~$0.0034 / minute | coming soon |
| shunyalabs/pingala-v1-universal | 808.9M | 73 | ~$0.0034 / minute | coming soon |
| kyutai/stt-2.6b-en | 2617.1M | - | ~$0.0034 / minute | coming soon |