skip to content
gigarouter gigarouter
models / speech-to-text · coming soon

Canary 1B Flash

nvidia/canary-1b-flash

published Mar 2025 · updated Jun 2026

Canary 1B Flash is a multilingual multitasking speech model that performs automatic speech recognition (ASR) in English, German, French, and Spanish, and speech-to-text translation between English and German, French, or Spanish, with optional punctuation/capitalization and word-level or segment-level timestamps.

est. price
~$0.0034
· estimated, set at launch
API providers
0
downloads / mo
3.9K
license
cc-by-4.0

specs

TaskASR and AST (Automatic Speech Translation)
ArchitectureFastConformer encoder with 32 layers + Transformer decoder with 4 layers
Parameters883 million
LicenseCC-BY-4.0

about this model

nvidia/canary-1b-flash is an automatic speech recognition (ASR) model that transcribes and translates speech across four languages. With 883 million parameters, it supports ASR in English, German, French, and Spanish, and speech-to-text translation from English into German, French, or Spanish, and from those languages into English. Output can include punctuation and capitalization (PnC) and optional word-level or segment-level timestamps (experimental). The model achieves an inference speed of more than 1,000 RTFx on the OpenASR leaderboard benchmark when run on an NVIDIA A100 GPU.

Architecture and Training

Canary-1b-flash uses an encoder-decoder architecture with a FastConformer encoder and a Transformer decoder. Task tokens (e.g., target language, task type, toggle timestamps) are fed into the decoder to control generation. The model was trained on 85,000 hours of speech data, comprising 31,000 hours of public datasets, 20,000 hours from Suno, and 34,000 hours of in-house data. The public data includes English, German, French, and Spanish speech from sources such as LibriSpeech, Multilingual LibriSpeech, Common Voice, VoxPopuli, and others.

Benchmark Performance

Word error rate (WER) on the Hugging Face OpenASR leaderboard (without PnC, using greedy decoding, text normalized with whisper-normalizer):

DatasetWER (%)
AMI13.11
GigaSpeech9.85
LibriSpeech Clean1.48
LibriSpeech Other2.87
Earnings2212.79
SPGISpeech1.95
Tedlium3.12
VoxPopuli5.63

Inference speed on an NVIDIA A100 (batch size 128) is 1,045.75 RTFx.

Licensing

This model is released under the CC-BY-4.0 license and is available for commercial use.

best for

FAQ

What languages does Canary 1B Flash support?

It supports English, German, French, and Spanish for ASR and translation.

How large is the model and how fast is it?

It has 883M parameters and achieves over 1000 RTFx inference speed on OpenASR benchmarks.

Can it generate timestamps?

Yes, it can produce word-level and segment-level timestamps for audio in English, German, French, and Spanish (experimental).

What is the license and can I use it commercially?

It is released under CC-BY-4.0, which allows commercial use.

How do I call this model via the gigarouter API?

Use the gigarouter OpenAI-compatible endpoint with your API key, specifying the model name "canary-1b-flash".

not yet live

We're benchmarking and onboarding Canary 1B Flash as a hosted, OpenAI-compatible API. Sign in for free credit and be ready when it lands, or tell us you want it and we'll prioritize it.

related speech-to-text models

compare all →