Wav2Vec2 XLSR-53 Large Portuguese

jonatasgrosman/wav2vec2-large-xlsr-53-portuguese

published Mar 2022 · updated Dec 2022

Wav2Vec2 XLSR-53 Large Portuguese is an automatic speech recognition model that transcribes Portuguese audio into text, fine-tuned on Common Voice 6.1.

status

coming soon

API providers

downloads / mo

3.2M

license

apache-2.0

specs

Task	Automatic Speech Recognition (ASR)
Architecture	Wav2Vec2-XLSR-53 Large
Language	Portuguese
Dataset	Common Voice 6.1
License	Apache 2.0

about this model

jonatasgrosman/wav2vec2-large-xlsr-53-portuguese is an automatic speech recognition (ASR) model fine-tuned from Facebook’s Wav2Vec2-XLSR-53 large checkpoint on Portuguese speech. It transcribes Portuguese audio into text and is optimized for input sampled at 16 kHz.

The model was fine-tuned on the train and validation splits of Mozilla Common Voice 6.1, and its performance has been evaluated on standard benchmarks without and with a language model (LM).

Benchmark Results

Dataset	Metric	Without LM	With LM
Common Voice pt (test)	WER	11.31%	9.01%
Common Voice pt (test)	CER	3.74%	3.21%
Robust Speech Event Dev Data	WER	42.1%	36.92%
Robust Speech Event Dev Data	CER	17.93%	16.88%

The model achieves a word error rate of 11.31% on Common Voice Portuguese test data without a language model, improving to 9.01% with an LM. Corresponding character error rates are 3.74% and 3.21%. On the more challenging Robust Speech Event Dev Data, the model achieves a WER of 42.1% without LM and 36.92% with LM.

This is a specialist model for Portuguese ASR, available as a hosted API on gigarouter. It is released under the Apache 2.0 license.

best for

·Transcribing Portuguese audio recordings such as calls and meetings
·Adding Portuguese speech-to-text to voice assistants, captioning, and transcription services

FAQ

What input sample rate does the model require?

The model expects speech input sampled at 16 kHz.

Can I use the model without a language model?

Yes, it can be used directly without a language model, but adding one improves accuracy (WER drops from 11.31% to 9.01% on Common Voice pt).

What is the license of this model?

Apache 2.0.

How do I call this model via the gigarouter API?

Use the OpenAI-compatible endpoint with your API key; refer to the gigarouter documentation for exact endpoint and request format.

What architecture is the model based on?

It is based on Wav2Vec2-XLSR-53 Large, a self-supervised speech representation model pre-trained on 53 languages.

not yet live

We're benchmarking and onboarding Wav2Vec2 XLSR-53 Large Portuguese as a hosted, OpenAI-compatible API. Sign in for free credit and be ready when it lands, or tell us you want it and we'll prioritize it.

related speech-to-text models

compare all →

speaker-diarization-3.1

wav2vec2-large-xlsr-53-japanese

6.1M dl/mo

wav2vec2-large-xlsr-53-polish

4.7M dl/mo

wav2vec2-large-xlsr-53-dutch

4.1M dl/mo