skip to content
gigarouter gigarouter
models / speech-to-text · coming soon

Parakeet TDT 0.6B V2

mlx-community/parakeet-tdt-0.6b-v2

published May 2025 · updated May 2025

Parakeet TDT 0.6B V2 is an automatic speech recognition (ASR) model that transcribes audio into text with word-level timestamps and punctuation.

status
coming soon
API providers
0
downloads / mo
1.5M
license
cc-by-4.0

specs

TaskAutomatic Speech Recognition (ASR)
ArchitectureFastConformer with TDT decoder
Parameters600M
LicenseCC-BY-4.0

about this model

Parakeet-TDT-0.6B-v2 is an automatic speech recognition (ASR) model that transcribes long-form audio with high accuracy and word-level timestamps. Built on a FastConformer architecture with a token-and-duration-transformer (TDT) decoder and 600M parameters, it processes segments up to 24 minutes in a single pass at a real-time factor (RTFx) of 3380 on the HF-Open-ASR leaderboard (batch size 128).

The model achieves strong word error rates (WERs) across diverse domains: LibriSpeech clean 1.69%, LibriSpeech other 3.19%, SPGI Speech 2.17%, tedlium-v3 3.38%, Vox Populi 5.95%, GigaSpeech test 9.74%, Earnings-22 11.15%, and AMI Meetings test 11.16%.

Key capabilities include automatic punctuation and capitalization, robust recognition of spoken numbers and song lyrics, and precise word-level timestamps. The model was trained on the nvidia/Granary and nvidia/nemo-asr-set-3.0 datasets and is released under the CC-BY-4.0 license.

Benchmark Summary

DatasetWER (%)
LibriSpeech clean1.69
LibriSpeech other3.19
SPGI Speech2.17
tedlium-v33.38
Vox Populi5.95
GigaSpeech test9.74
Earnings-2211.15
AMI Meetings11.16

Source: NVIDIA model card

best for

FAQ

What is this model best for?

It excels at transcribing long audio segments (up to 24 minutes) with word-level timestamps, automatic punctuation, and high accuracy on numbers and song lyrics.

How does Parakeet TDT 0.6B V2 compare in size and speed?

With 600 million parameters, it is compact yet very fast, achieving a real-time factor (RTFx) of 3380 with a batch size of 128 on the HF-Open-ASR leaderboard.

What are the license terms?

The model is released under the CC-BY-4.0 license, allowing use with attribution.

What input and output formats does the model support?

It accepts raw audio (e.g., WAV files) and outputs text with accurate word-level timestamps and automatic punctuation and capitalization.

How can I call this model via the gigarouter API?

Use the gigarouter OpenAI-compatible endpoint with your API key. Pass the audio file as input and receive the transcribed text in the response.

not yet live

We're benchmarking and onboarding Parakeet TDT 0.6B V2 as a hosted, OpenAI-compatible API. Sign in for free credit and be ready when it lands, or tell us you want it and we'll prioritize it.

related speech-to-text models

compare all →