Parakeet TDT 0.6B V2
mlx-community/parakeet-tdt-0.6b-v2
published May 2025 · updated May 2025
Parakeet TDT 0.6B V2 is an automatic speech recognition (ASR) model that transcribes audio into text with word-level timestamps and punctuation.
specs
| Task | Automatic Speech Recognition (ASR) |
| Architecture | FastConformer with TDT decoder |
| Parameters | 600M |
| License | CC-BY-4.0 |
about this model
Parakeet-TDT-0.6B-v2 is an automatic speech recognition (ASR) model that transcribes long-form audio with high accuracy and word-level timestamps. Built on a FastConformer architecture with a token-and-duration-transformer (TDT) decoder and 600M parameters, it processes segments up to 24 minutes in a single pass at a real-time factor (RTFx) of 3380 on the HF-Open-ASR leaderboard (batch size 128).
The model achieves strong word error rates (WERs) across diverse domains: LibriSpeech clean 1.69%, LibriSpeech other 3.19%, SPGI Speech 2.17%, tedlium-v3 3.38%, Vox Populi 5.95%, GigaSpeech test 9.74%, Earnings-22 11.15%, and AMI Meetings test 11.16%.
Key capabilities include automatic punctuation and capitalization, robust recognition of spoken numbers and song lyrics, and precise word-level timestamps. The model was trained on the nvidia/Granary and nvidia/nemo-asr-set-3.0 datasets and is released under the CC-BY-4.0 license.
Benchmark Summary
| Dataset | WER (%) |
|---|---|
| LibriSpeech clean | 1.69 |
| LibriSpeech other | 3.19 |
| SPGI Speech | 2.17 |
| tedlium-v3 | 3.38 |
| Vox Populi | 5.95 |
| GigaSpeech test | 9.74 |
| Earnings-22 | 11.15 |
| AMI Meetings | 11.16 |
Source: NVIDIA model card
best for
- ·Transcribing long meetings or lectures (up to 24 minutes in a single pass)
- ·Real-time captioning with high throughput (RTFx 3380)
- ·Transcribing song lyrics or spoken numbers with high accuracy
FAQ
It excels at transcribing long audio segments (up to 24 minutes) with word-level timestamps, automatic punctuation, and high accuracy on numbers and song lyrics.
With 600 million parameters, it is compact yet very fast, achieving a real-time factor (RTFx) of 3380 with a batch size of 128 on the HF-Open-ASR leaderboard.
The model is released under the CC-BY-4.0 license, allowing use with attribution.
It accepts raw audio (e.g., WAV files) and outputs text with accurate word-level timestamps and automatic punctuation and capitalization.
Use the gigarouter OpenAI-compatible endpoint with your API key. Pass the audio file as input and receive the transcribed text in the response.
We're benchmarking and onboarding Parakeet TDT 0.6B V2 as a hosted, OpenAI-compatible API. Sign in for free credit and be ready when it lands, or tell us you want it and we'll prioritize it.