Wav2Vec2 Large XLSR-53 Telugu
anuragshas/wav2vec2-large-xlsr-53-telugu
published Mar 2022 · updated Jul 2021
Wav2Vec2 Large XLSR-53 Telugu is an automatic speech recognition model fine-tuned on the OpenSLR SLR66 Telugu dataset for transcribing Telugu speech.
specs
| Task | Automatic Speech Recognition (ASR) |
| Architecture | Wav2Vec2 Large XLSR-53 |
| Language | Telugu |
| Training Dataset | OpenSLR SLR66 (70% split) |
about this model
anuragshas/wav2vec2-large-xlsr-53-telugu is an automatic speech recognition (ASR) model fine-tuned from facebook/wav2vec2-large-xlsr-53 for transcribing Telugu speech. The model was trained on 70% of the OpenSLR SLR66 Telugu dataset (CC BY-SA 4.0 licensed, containing male and female speaker recordings) and requires input audio sampled at 16 kHz.
Performance
On the OpenSLR SLR66 test set, the model achieves a Word Error Rate (WER) of 44.98% when used without a language model. The evaluation applies a text normalizer that removes non-Telugu characters and punctuation, and lowercases the reference text.
Architecture and Training
The model is fine-tuned from facebook/wav2vec2-large-xlsr-53 (Apache-2.0 licensed), a cross-lingual pretrained model that demonstrated a 72% relative phoneme error rate reduction on CommonVoice and 16% relative WER improvement on BABEL in the original XLSR-53 paper. Training used 70% of the OpenSLR SLR66 Telugu dataset.
Evaluation
On the held-out test split of the OpenSLR SLR66 Telugu dataset, the model achieves a Word Error Rate (WER) of 44.98% when used without a language model. The evaluation normalizes text by removing punctuation, non-Telugu characters, and lowercasing.
Dataset and License
The OpenSLR SLR66 Telugu dataset is licensed under CC BY-SA 4.0. The base model facebook/wav2vec2-large-xlsr-53 is released under Apache-2.0. The original XLSR-53 paper (arXiv:2006.13979) reported a 72% relative phoneme error rate reduction on CommonVoice and 16% relative WER improvement on BABEL, providing context for the pretraining approach.
best for
- ·Transcribing Telugu speech audio into text
- ·Building voice-enabled applications for Telugu (e.g., dictation, voice search)
- ·Extracting content from Telugu audio recordings
FAQ
Speech input must be sampled at 16 kHz; the model expects raw audio arrays (e.g., from torchaudio).
Use the OpenAI-compatible endpoint with your API key, sending an audio file or base64-encoded audio.
The model achieves a WER of 44.98% on the OpenSLR Telugu test set.
It was fine-tuned on 70% of the OpenSLR SLR66 Telugu dataset, which includes male and female speaker recordings under CC BY-SA 4.0.
It is fine-tuned from Facebook's wav2vec2-large-xlsr-53, which is released under Apache 2.0.
We're benchmarking and onboarding Wav2Vec2 Large XLSR-53 Telugu as a hosted, OpenAI-compatible API. Sign in for free credit and be ready when it lands, or tell us you want it and we'll prioritize it.