Wav2Vec2 Indonesian Javanese Sundanese

indonesian-nlp/wav2vec2-indonesian-javanese-sundanese

published Mar 2022 · updated Aug 2022

Wav2Vec2 Indonesian Javanese Sundanese is a multilingual automatic speech recognition model that transcribes audio in Indonesian, Javanese, and Sundanese languages.

status

coming soon

API providers

downloads / mo

4.1M

license

apache-2.0

specs

Task	Automatic Speech Recognition (ASR)
Architecture	Wav2Vec2 Large (XLSR-53)
Languages	Indonesian, Javanese, Sundanese
Training Data	Common Voice (Indonesian), SLR41 (Javanese), SLR44 (Sundanese)
WER (Indonesian, without LM)	11.57%
WER (Indonesian, with LM)	4.27%

about this model

This model is an automatic speech recognition (ASR) model fine-tuned from facebook/wav2vec2-large-xlsr-53 to support three major Indonesian languages: Indonesian, Javanese, and Sundanese. It was trained on the Indonesian Common Voice dataset, High-quality TTS data for Javanese (SLR41), and High-quality TTS data for Sundanese (SLR44). The model accepts speech input sampled at 16 kHz and outputs transcribed text.

Key strengths include robust multilingual performance across languages that are not natively covered by most single-language ASR systems. Monolingual baselines trained on just one language degrade severely on the other two (e.g., an Indonesian-only model achieves 78.06% WER on Javanese and 64.04% WER on Sundanese), whereas this multilingual model delivers competitive word error rates (WER) on all three.

Notable benchmark results on the Indonesian Common Voice 6.1 test set:

Without a language model (300-epoch multilingual model): Indonesian WER 11.57%, Javanese WER 16.57%, Sundanese WER 6.72%.
With a KenLM language model trained on Common Voice text: Indonesian WER drops to 4.27%; with a Wikipedia-based LM it reaches 5.15%.
For comparison, Google Speech-to-Text reports a WER of 9.22% on the same Indonesian test set—the model with LM outperforms it.

The following figure compares models by WER on the Indonesian test set (without language model):

All models were trained for 200 epochs; the reported 11.57% WER on Indonesian corresponds to the 300-epoch multilingual model. The 4.27% WER with LM was achieved using dataset version v7.

best for

·Transcribing Indonesian, Javanese, or Sundanese speech from audio files
·Enabling voice-to-text for regional language applications in Indonesia
·Building multilingual voice assistants and call center automation

FAQ

What languages does this ASR model support?

It supports Indonesian, Javanese, and Sundanese.

What is the word error rate on Indonesian speech?

Without a language model, 11.57% on Common Voice test set; with a KenLM language model, it improves to 4.27%.

What audio input format is required?

The model expects mono audio sampled at 16kHz.

How can I use this model via the gigarouter API?

Send requests to the OpenAI-compatible endpoint with your API key, providing the audio as a base64-encoded WAV file or a URL.

What is the underlying architecture of the model?

It is fine-tuned from Facebook's Wav2Vec2 Large XLSR-53, a self-supervised speech representation model.

not yet live

We're benchmarking and onboarding Wav2Vec2 Indonesian Javanese Sundanese as a hosted, OpenAI-compatible API. Sign in for free credit and be ready when it lands, or tell us you want it and we'll prioritize it.

related speech-to-text models

compare all →

speaker-diarization-3.1

wav2vec2-large-xlsr-53-japanese

6.1M dl/mo

wav2vec2-large-xlsr-53-polish

4.7M dl/mo

wav2vec2-large-xlsr-53-dutch

4.1M dl/mo