Canary 1B Flash

nvidia/canary-1b-flash

published Mar 2025 · updated Jun 2026

Canary 1B Flash is a multilingual multitasking speech model that performs automatic speech recognition (ASR) in English, German, French, and Spanish, and speech-to-text translation between English and German, French, or Spanish, with optional punctuation/capitalization and word-level or segment-level timestamps.

est. price

~$0.0034

· estimated, set at launch

API providers

downloads / mo

3.9K

license

cc-by-4.0

specs

Task	ASR and AST (Automatic Speech Translation)
Architecture	FastConformer encoder with 32 layers + Transformer decoder with 4 layers
Parameters	883 million
License	CC-BY-4.0

about this model

nvidia/canary-1b-flash is an automatic speech recognition (ASR) model that transcribes and translates speech across four languages. With 883 million parameters, it supports ASR in English, German, French, and Spanish, and speech-to-text translation from English into German, French, or Spanish, and from those languages into English. Output can include punctuation and capitalization (PnC) and optional word-level or segment-level timestamps (experimental). The model achieves an inference speed of more than 1,000 RTFx on the OpenASR leaderboard benchmark when run on an NVIDIA A100 GPU.

Architecture and Training

Canary-1b-flash uses an encoder-decoder architecture with a FastConformer encoder and a Transformer decoder. Task tokens (e.g., target language, task type, toggle timestamps) are fed into the decoder to control generation. The model was trained on 85,000 hours of speech data, comprising 31,000 hours of public datasets, 20,000 hours from Suno, and 34,000 hours of in-house data. The public data includes English, German, French, and Spanish speech from sources such as LibriSpeech, Multilingual LibriSpeech, Common Voice, VoxPopuli, and others.

Benchmark Performance

Word error rate (WER) on the Hugging Face OpenASR leaderboard (without PnC, using greedy decoding, text normalized with whisper-normalizer):

Dataset	WER (%)
AMI	13.11
GigaSpeech	9.85
LibriSpeech Clean	1.48
LibriSpeech Other	2.87
Earnings22	12.79
SPGISpeech	1.95
Tedlium	3.12
VoxPopuli	5.63

Inference speed on an NVIDIA A100 (batch size 128) is 1,045.75 RTFx.

Licensing

This model is released under the CC-BY-4.0 license and is available for commercial use.

best for

·Transcribing English, German, French, and Spanish audio with punctuation
·Translating speech from English to German, French, or Spanish and vice versa
·Generating word-level and segment-level timestamps for audio in supported languages

FAQ

What languages does Canary 1B Flash support?

It supports English, German, French, and Spanish for ASR and translation.

How large is the model and how fast is it?

It has 883M parameters and achieves over 1000 RTFx inference speed on OpenASR benchmarks.

Can it generate timestamps?

Yes, it can produce word-level and segment-level timestamps for audio in English, German, French, and Spanish (experimental).

What is the license and can I use it commercially?

It is released under CC-BY-4.0, which allows commercial use.

How do I call this model via the gigarouter API?

Use the gigarouter OpenAI-compatible endpoint with your API key, specifying the model name "canary-1b-flash".

not yet live

We're benchmarking and onboarding Canary 1B Flash as a hosted, OpenAI-compatible API. Sign in for free credit and be ready when it lands, or tell us you want it and we'll prioritize it.

related speech-to-text models

compare all →

speaker-diarization-3.1

wav2vec2-large-xlsr-53-japanese

6.1M dl/mo

wav2vec2-large-xlsr-53-polish

4.7M dl/mo

wav2vec2-large-xlsr-53-dutch

4.1M dl/mo