Wav2Vec2 Indonesian Javanese Sundanese
indonesian-nlp/wav2vec2-indonesian-javanese-sundanese
published Mar 2022 · updated Aug 2022
Wav2Vec2 Indonesian Javanese Sundanese is a multilingual automatic speech recognition model that transcribes audio in Indonesian, Javanese, and Sundanese languages.
specs
| Task | Automatic Speech Recognition (ASR) |
| Architecture | Wav2Vec2 Large (XLSR-53) |
| Languages | Indonesian, Javanese, Sundanese |
| Training Data | Common Voice (Indonesian), SLR41 (Javanese), SLR44 (Sundanese) |
| WER (Indonesian, without LM) | 11.57% |
| WER (Indonesian, with LM) | 4.27% |
about this model
This model is an automatic speech recognition (ASR) model fine-tuned from facebook/wav2vec2-large-xlsr-53 to support three major Indonesian languages: Indonesian, Javanese, and Sundanese. It was trained on the Indonesian Common Voice dataset, High-quality TTS data for Javanese (SLR41), and High-quality TTS data for Sundanese (SLR44). The model accepts speech input sampled at 16 kHz and outputs transcribed text.
Key strengths include robust multilingual performance across languages that are not natively covered by most single-language ASR systems. Monolingual baselines trained on just one language degrade severely on the other two (e.g., an Indonesian-only model achieves 78.06% WER on Javanese and 64.04% WER on Sundanese), whereas this multilingual model delivers competitive word error rates (WER) on all three.
Notable benchmark results on the Indonesian Common Voice 6.1 test set:
- Without a language model (300-epoch multilingual model): Indonesian WER 11.57%, Javanese WER 16.57%, Sundanese WER 6.72%.
- With a KenLM language model trained on Common Voice text: Indonesian WER drops to 4.27%; with a Wikipedia-based LM it reaches 5.15%.
- For comparison, Google Speech-to-Text reports a WER of 9.22% on the same Indonesian test set—the model with LM outperforms it.
The following figure compares models by WER on the Indonesian test set (without language model):
All models were trained for 200 epochs; the reported 11.57% WER on Indonesian corresponds to the 300-epoch multilingual model. The 4.27% WER with LM was achieved using dataset version v7.
best for
- ·Transcribing Indonesian, Javanese, or Sundanese speech from audio files
- ·Enabling voice-to-text for regional language applications in Indonesia
- ·Building multilingual voice assistants and call center automation
FAQ
It supports Indonesian, Javanese, and Sundanese.
Without a language model, 11.57% on Common Voice test set; with a KenLM language model, it improves to 4.27%.
The model expects mono audio sampled at 16kHz.
Send requests to the OpenAI-compatible endpoint with your API key, providing the audio as a base64-encoded WAV file or a URL.
It is fine-tuned from Facebook's Wav2Vec2 Large XLSR-53, a self-supervised speech representation model.
We're benchmarking and onboarding Wav2Vec2 Indonesian Javanese Sundanese as a hosted, OpenAI-compatible API. Sign in for free credit and be ready when it lands, or tell us you want it and we'll prioritize it.