Wav2Vec2 XLS-R 300M Bengali

arijitx/wav2vec2-xls-r-300m-bengali

published Mar 2022 · updated Mar 2022

Wav2Vec2 XLS-R 300M Bengali is an automatic speech recognition (ASR) model fine-tuned for Bengali, based on the wav2vec2-xls-r-300m architecture, achieving a word error rate of 21.7% without language model and 15.3% with a 5-gram language model on the OpenSLR SLR53 evaluation set.

status

coming soon

API providers

downloads / mo

1.4M

license

apache-2.0

specs

Task	Automatic Speech Recognition (ASR)
Architecture	wav2vec2-xls-r-300m (fine-tuned)
Parameters	300 million
License	Apache-2.0 (base model)

about this model

arijitx/wav2vec2-xls-r-300m-bengali is a speech recognition (ASR) model fine-tuned from facebook/wav2vec2-xls-r-300m on the OpenSLR SLR53 Bengali dataset. The base model was pretrained on 436k hours of unlabeled speech across 128 languages, providing a strong foundation for Bengali ASR.

Evaluation Results

Performance is measured on a held-out evaluation set comprising 5% of the total 10,935 samples (approximately 547 samples). Metrics are reported without and with a 5-gram language model.

Condition	Word Error Rate (WER)	Character Error Rate (CER)
Without language model	0.2173	0.0473
With 5-gram language model	0.1532	0.0341

The language model was trained on 30 million sentences from the AI4Bharat IndicCorp Bengali corpus, which contains approximately 39.9 million sentences and 836 million tokens in total. Training was stopped after 180k steps.

The fine-tuned model does not specify a license; the base wav2vec2-xls-r-300m model is licensed under Apache-2.0.

best for

·Transcribing Bengali audio from meetings, lectures, or interviews
·Building voice-enabled Bengali applications like voice search or dictation
·Integrating into a speech-to-text pipeline with optional LM boosting for higher accuracy

FAQ

What is this model best used for?

It is designed for transcribing Bengali speech into text, with optional integration of a 5-gram language model to reduce word error rate.

How does it compare in size to other ASR models?

It has 300 million parameters, making it moderately sized; it is fine-tuned from Facebook's XLS-R cross-lingual model.

What are the license terms?

The base model is licensed under Apache-2.0; the fine-tuned model card does not specify a separate license, so Apache-2.0 applies.

What input/output format does it expect?

It expects audio input (speech) and outputs transcribed text. For API usage, send audio data via the gigarouter OpenAI-compatible endpoint.

How can I call this model via the API?

Use the gigarouter OpenAI-compatible endpoint with an API key, sending audio as input and receiving transcribed text in the response.

not yet live

We're benchmarking and onboarding Wav2Vec2 XLS-R 300M Bengali as a hosted, OpenAI-compatible API. Sign in for free credit and be ready when it lands, or tell us you want it and we'll prioritize it.

related speech-to-text models

compare all →

speaker-diarization-3.1

wav2vec2-large-xlsr-53-japanese

6.1M dl/mo

wav2vec2-large-xlsr-53-polish

4.7M dl/mo

wav2vec2-large-xlsr-53-dutch

4.1M dl/mo