skip to content
gigarouter gigarouter
models / speech-to-text · coming soon

Wav2Vec2 Large XLSR-53 Telugu

anuragshas/wav2vec2-large-xlsr-53-telugu

published Mar 2022 · updated Jul 2021

Wav2Vec2 Large XLSR-53 Telugu is an automatic speech recognition model fine-tuned on the OpenSLR SLR66 Telugu dataset for transcribing Telugu speech.

status
coming soon
API providers
0
downloads / mo
2.8M
license
apache-2.0

specs

TaskAutomatic Speech Recognition (ASR)
ArchitectureWav2Vec2 Large XLSR-53
LanguageTelugu
Training DatasetOpenSLR SLR66 (70% split)

about this model

anuragshas/wav2vec2-large-xlsr-53-telugu is an automatic speech recognition (ASR) model fine-tuned from facebook/wav2vec2-large-xlsr-53 for transcribing Telugu speech. The model was trained on 70% of the OpenSLR SLR66 Telugu dataset (CC BY-SA 4.0 licensed, containing male and female speaker recordings) and requires input audio sampled at 16 kHz.

Performance

On the OpenSLR SLR66 test set, the model achieves a Word Error Rate (WER) of 44.98% when used without a language model. The evaluation applies a text normalizer that removes non-Telugu characters and punctuation, and lowercases the reference text.

Architecture and Training

The model is fine-tuned from facebook/wav2vec2-large-xlsr-53 (Apache-2.0 licensed), a cross-lingual pretrained model that demonstrated a 72% relative phoneme error rate reduction on CommonVoice and 16% relative WER improvement on BABEL in the original XLSR-53 paper. Training used 70% of the OpenSLR SLR66 Telugu dataset.

Evaluation

On the held-out test split of the OpenSLR SLR66 Telugu dataset, the model achieves a Word Error Rate (WER) of 44.98% when used without a language model. The evaluation normalizes text by removing punctuation, non-Telugu characters, and lowercasing.

Dataset and License

The OpenSLR SLR66 Telugu dataset is licensed under CC BY-SA 4.0. The base model facebook/wav2vec2-large-xlsr-53 is released under Apache-2.0. The original XLSR-53 paper (arXiv:2006.13979) reported a 72% relative phoneme error rate reduction on CommonVoice and 16% relative WER improvement on BABEL, providing context for the pretraining approach.

best for

FAQ

What audio input format does the model require?

Speech input must be sampled at 16 kHz; the model expects raw audio arrays (e.g., from torchaudio).

How do I call this model via the gigarouter API?

Use the OpenAI-compatible endpoint with your API key, sending an audio file or base64-encoded audio.

What is the model's word error rate (WER) on the test set?

The model achieves a WER of 44.98% on the OpenSLR Telugu test set.

What data was the model trained on?

It was fine-tuned on 70% of the OpenSLR SLR66 Telugu dataset, which includes male and female speaker recordings under CC BY-SA 4.0.

Which base model is this fine-tuned from?

It is fine-tuned from Facebook's wav2vec2-large-xlsr-53, which is released under Apache 2.0.

not yet live

We're benchmarking and onboarding Wav2Vec2 Large XLSR-53 Telugu as a hosted, OpenAI-compatible API. Sign in for free credit and be ready when it lands, or tell us you want it and we'll prioritize it.

related speech-to-text models

compare all →