Wav2Vec2 Large XLSR-53 Telugu

anuragshas/wav2vec2-large-xlsr-53-telugu

published Mar 2022 · updated Jul 2021

Wav2Vec2 Large XLSR-53 Telugu is an automatic speech recognition model fine-tuned on the OpenSLR SLR66 Telugu dataset for transcribing Telugu speech.

status

coming soon

API providers

downloads / mo

2.8M

license

apache-2.0

specs

Task	Automatic Speech Recognition (ASR)
Architecture	Wav2Vec2 Large XLSR-53
Language	Telugu
Training Dataset	OpenSLR SLR66 (70% split)

about this model

anuragshas/wav2vec2-large-xlsr-53-telugu is an automatic speech recognition (ASR) model fine-tuned from facebook/wav2vec2-large-xlsr-53 for transcribing Telugu speech. The model was trained on 70% of the OpenSLR SLR66 Telugu dataset (CC BY-SA 4.0 licensed, containing male and female speaker recordings) and requires input audio sampled at 16 kHz.

Performance

On the OpenSLR SLR66 test set, the model achieves a Word Error Rate (WER) of 44.98% when used without a language model. The evaluation applies a text normalizer that removes non-Telugu characters and punctuation, and lowercases the reference text.

Architecture and Training

The model is fine-tuned from facebook/wav2vec2-large-xlsr-53 (Apache-2.0 licensed), a cross-lingual pretrained model that demonstrated a 72% relative phoneme error rate reduction on CommonVoice and 16% relative WER improvement on BABEL in the original XLSR-53 paper. Training used 70% of the OpenSLR SLR66 Telugu dataset.

Evaluation

On the held-out test split of the OpenSLR SLR66 Telugu dataset, the model achieves a Word Error Rate (WER) of 44.98% when used without a language model. The evaluation normalizes text by removing punctuation, non-Telugu characters, and lowercasing.

Dataset and License

The OpenSLR SLR66 Telugu dataset is licensed under CC BY-SA 4.0. The base model facebook/wav2vec2-large-xlsr-53 is released under Apache-2.0. The original XLSR-53 paper (arXiv:2006.13979) reported a 72% relative phoneme error rate reduction on CommonVoice and 16% relative WER improvement on BABEL, providing context for the pretraining approach.

best for

·Transcribing Telugu speech audio into text
·Building voice-enabled applications for Telugu (e.g., dictation, voice search)
·Extracting content from Telugu audio recordings

FAQ

What audio input format does the model require?

Speech input must be sampled at 16 kHz; the model expects raw audio arrays (e.g., from torchaudio).

How do I call this model via the gigarouter API?

Use the OpenAI-compatible endpoint with your API key, sending an audio file or base64-encoded audio.

What is the model's word error rate (WER) on the test set?

The model achieves a WER of 44.98% on the OpenSLR Telugu test set.

What data was the model trained on?

It was fine-tuned on 70% of the OpenSLR SLR66 Telugu dataset, which includes male and female speaker recordings under CC BY-SA 4.0.

Which base model is this fine-tuned from?

It is fine-tuned from Facebook's wav2vec2-large-xlsr-53, which is released under Apache 2.0.

not yet live

We're benchmarking and onboarding Wav2Vec2 Large XLSR-53 Telugu as a hosted, OpenAI-compatible API. Sign in for free credit and be ready when it lands, or tell us you want it and we'll prioritize it.

related speech-to-text models

compare all →

speaker-diarization-3.1

wav2vec2-large-xlsr-53-japanese

6.1M dl/mo

wav2vec2-large-xlsr-53-polish

4.7M dl/mo

wav2vec2-large-xlsr-53-dutch

4.1M dl/mo