Wav2Vec2 XLS-R 300M Bengali
arijitx/wav2vec2-xls-r-300m-bengali
published Mar 2022 · updated Mar 2022
Wav2Vec2 XLS-R 300M Bengali is an automatic speech recognition (ASR) model fine-tuned for Bengali, based on the wav2vec2-xls-r-300m architecture, achieving a word error rate of 21.7% without language model and 15.3% with a 5-gram language model on the OpenSLR SLR53 evaluation set.
specs
| Task | Automatic Speech Recognition (ASR) |
| Architecture | wav2vec2-xls-r-300m (fine-tuned) |
| Parameters | 300 million |
| License | Apache-2.0 (base model) |
about this model
arijitx/wav2vec2-xls-r-300m-bengali is a speech recognition (ASR) model fine-tuned from facebook/wav2vec2-xls-r-300m on the OpenSLR SLR53 Bengali dataset. The base model was pretrained on 436k hours of unlabeled speech across 128 languages, providing a strong foundation for Bengali ASR.
Evaluation Results
Performance is measured on a held-out evaluation set comprising 5% of the total 10,935 samples (approximately 547 samples). Metrics are reported without and with a 5-gram language model.
| Condition | Word Error Rate (WER) | Character Error Rate (CER) |
|---|---|---|
| Without language model | 0.2173 | 0.0473 |
| With 5-gram language model | 0.1532 | 0.0341 |
The language model was trained on 30 million sentences from the AI4Bharat IndicCorp Bengali corpus, which contains approximately 39.9 million sentences and 836 million tokens in total. Training was stopped after 180k steps.
The fine-tuned model does not specify a license; the base wav2vec2-xls-r-300m model is licensed under Apache-2.0.
best for
- ·Transcribing Bengali audio from meetings, lectures, or interviews
- ·Building voice-enabled Bengali applications like voice search or dictation
- ·Integrating into a speech-to-text pipeline with optional LM boosting for higher accuracy
FAQ
It is designed for transcribing Bengali speech into text, with optional integration of a 5-gram language model to reduce word error rate.
It has 300 million parameters, making it moderately sized; it is fine-tuned from Facebook's XLS-R cross-lingual model.
The base model is licensed under Apache-2.0; the fine-tuned model card does not specify a separate license, so Apache-2.0 applies.
It expects audio input (speech) and outputs transcribed text. For API usage, send audio data via the gigarouter OpenAI-compatible endpoint.
Use the gigarouter OpenAI-compatible endpoint with an API key, sending audio as input and receiving transcribed text in the response.
We're benchmarking and onboarding Wav2Vec2 XLS-R 300M Bengali as a hosted, OpenAI-compatible API. Sign in for free credit and be ready when it lands, or tell us you want it and we'll prioritize it.