skip to content
gigarouter gigarouter
models / speech-to-text · coming soon

Wav2Vec 2.0 Large VoxRex Swedish

KBLab/wav2vec2-large-voxrex-swedish

published Mar 2022 · updated May 2025

Wav2Vec 2.0 Large VoxRex Swedish is an automatic speech recognition (ASR) model fine-tuned on Swedish radio broadcasts, NST, and Common Voice data.

est. price
~$0.0034
· estimated, set at launch
API providers
0
downloads / mo
2.5M
license
cc0-1.0

specs

TaskAutomatic Speech Recognition (ASR)
ArchitectureWav2Vec 2.0 Large
Input Sampling Rate16 kHz
LanguageSwedish
LicenseCC0-1.0

about this model

KBLab/wav2vec2-large-voxrex-swedish is an automatic speech recognition (ASR) model for Swedish, fine-tuned from the VoxRex large pretrained model on Swedish radio broadcasts, NST, and Common Voice data.

Key strengths

Without a language model, the model achieves a word error rate (WER) of 2.5% on the combined NST and Common Voice test set. On the Common Voice test set alone, the WER is 8.49% directly and 7.37% when used with a 4-gram language model. The base VoxRex model was pretrained for 400,000 updates on the P4-10k corpus, containing 10,000 hours of Swedish local public service radio and 1,500 hours of audiobooks and other speech from KB's collections.

Performance illustration

Chart comparing word error rates across Swedish ASR models The chart shows performance without the additional 20,000 steps of Common Voice fine-tuning.

Word error rate during training over 120,000 updates

Usage notes

Speech input must be sampled at 16 kHz. The model is distributed under the CC0-1.0 license. Further details are described in the paper Hearing voices at the National Library – a speech corpus and acoustic model for the Swedish language (Fonetik 2022).

best for

FAQ

What is the word error rate (WER) of this model?

On the NST + Common Voice test set the WER is 2.5%. On the Common Voice test set alone it is 8.49% without a language model and 7.37% with a 4-gram language model.

How do I use this model via the gigarouter API?

Use the gigarouter OpenAI-compatible endpoint with your API key. Send audio sampled at 16 kHz as input and receive transcription text in the response.

What license is the model released under?

The model is released under the CC0-1.0 license, allowing free use for any purpose.

Does the model require a language model to work?

No, the model can be used directly without a language model as shown in the usage example. A 4-gram language model can optionally improve accuracy.

What audio format and sampling rate does the model expect?

The model expects audio input sampled at 16 kHz. The example code uses torchaudio to resample from 48 kHz to 16 kHz if needed.

not yet live

We're benchmarking and onboarding Wav2Vec 2.0 Large VoxRex Swedish as a hosted, OpenAI-compatible API. Sign in for free credit and be ready when it lands, or tell us you want it and we'll prioritize it.

related speech-to-text models

compare all →