Wav2Vec2 XLS-R 300M Mixed

mesolitica/wav2vec2-xls-r-300m-mixed

published Jun 2022 · updated Jun 2022

Wav2Vec2 XLS-R 300M Mixed is an automatic speech recognition model that transcribes audio in Malay, Singlish, and Mandarin.

status

coming soon

API providers

downloads / mo

1.8M

specs

Task	Automatic Speech Recognition (ASR)
Architecture	wav2vec 2.0
Parameters	300M
License	Apache-2.0

about this model

wav2vec2-xls-r-300m-mixed is an automatic speech recognition (ASR) model fine-tuned from Facebook’s XLS-R 300M checkpoint on a mixed dataset of Malay, Singlish, and Mandarin speech. The base XLS-R model uses the wav2vec 2.0 architecture, contains 300 million parameters, and was pretrained on 436,000 hours of unlabeled speech across 128 languages (Apache-2.0 licensed). This fine-tuned variant is specialized for three languages and is hosted on gigarouter as a managed, OpenAI-compatible API.

The model was trained on a single RTX 3090 Ti 24GB VRAM and evaluated on held-out sets (Malay: 765 utterances, Singlish: 3,579, Mandarin: 614). A language model (huseinzol05/language-model-bahasa-manglish-combined) is available to further reduce error rates via LM-decoding.

Benchmark Results

Evaluation Set	CER	WER	CER (with LM)	WER (with LM)
Mixed	0.0481	0.1322	0.0412	0.0988
Malay	0.0516	0.1956	0.0392	0.1271
Singlish	0.0495	0.1276	0.0427	0.0968
Mandarin	0.0356	0.0799	0.0349	0.0754

All metrics are reported on the evaluation set from the Malaya Speech STT preparation. The language model offers consistent improvements across all languages.

best for

·Transcribing Malay conversational audio
·Transcribing Singlish (Singapore English mixed with Chinese dialects) speech
·Transcribing Mandarin Chinese speech

FAQ

What languages does this model support?

It supports Malay, Singlish, and Mandarin Chinese.

What input format does the API expect?

Audio files in common formats (e.g., WAV, MP3) as a binary upload or base64-encoded string.

Does the model use a language model for decoding?

Yes, it can optionally use an external language model (LM) to improve accuracy; LM-enhanced metrics are provided for each language.

How do I call this model via the gigarouter API?

Use the OpenAI-compatible endpoint with your API key, specifying the model name wav2vec2-xls-r-300m-mixed.

What are the reported error rates on the evaluation set?

Mixed evaluation: CER 4.8%, WER 13.2%; with LM: CER 4.1%, WER 9.9%. Breakdown per language is available in the model card.

not yet live

We're benchmarking and onboarding Wav2Vec2 XLS-R 300M Mixed as a hosted, OpenAI-compatible API. Sign in for free credit and be ready when it lands, or tell us you want it and we'll prioritize it.

related speech-to-text models

compare all →

speaker-diarization-3.1

wav2vec2-large-xlsr-53-japanese

6.1M dl/mo

wav2vec2-large-xlsr-53-polish

4.7M dl/mo

wav2vec2-large-xlsr-53-dutch

4.1M dl/mo