Wav2Vec2 Large XLSR-53 Thai
airesearch/wav2vec2-large-xlsr-53-th
published Mar 2022 · updated Mar 2022
Wav2Vec2 Large XLSR-53 Thai is an automatic speech recognition model fine-tuned for Thai language on Common Voice 7.0.
specs
| Task | Automatic Speech Recognition (ASR) |
| Architecture | Wav2Vec2 large XLSR-53 |
| License | CC-BY-SA-4.0 |
about this model
airesearch/wav2vec2-large-xlsr-53-th is an automatic speech recognition (ASR) model fine-tuned from Facebook's wav2vec2-large-xlsr-53 on Thai Common Voice 7.0. It transcribes Thai speech into text and is optimized for low character error rate (CER) and word error rate (WER).
The model was trained on 133 validated hours of Thai speech from Common Voice Corpus 7.0, with a single V100 GPU, and selected based on lowest validation loss. Its performance is benchmarked against several commercial APIs on the test set using both PyThaiNLP and deepcut tokenization.
Benchmark Results (Common Voice 7 test set)
| System | WER (PyThaiNLP 2.3.1) | WER (deepcut) | CER |
|---|---|---|---|
| Kaldi from scratch (baseline) | 23.04 | — | 7.57 |
| Ours without spell correction | 13.63 | 8.15 | 2.81 |
| Ours with spell correction | 18.00 | 14.17 | 5.23 |
| Google Web Speech API | 13.71 | 10.86 | 7.36 |
| Microsoft Bing Speech API | 12.58 | 9.62 | 5.02 |
| Amazon Transcribe | 21.86 | 14.49 | 7.08 |
| NECTEC AI for Thai Partii API | 20.11 | 15.52 | 9.55 |
Note: Commercial APIs were not fine-tuned on Common Voice 7.0 data.
Additional Tokenization Benchmark (robust-speech-event)
| Tokenization | WER (PyThaiNLP 2.3.1) | WER (deepcut) | SER | CER |
|---|---|---|---|---|
| Only Tokenization | 0.9524% | 2.5316% | 1.2346% | 0.1623% |
These results reflect perfect transcription after tokenization alignment; the primary benchmark against commercial APIs provides the realistic performance for production use.
The model achieves the lowest CER (2.81%) among all systems tested without spell correction, and competitive WER against leading cloud APIs. It is licensed under CC-BY-SA 4.0.
best for
- ·Transcribing Thai speech audio into text
- ·Building Thai voice-enabled applications
- ·Benchmarking Thai ASR with Common Voice 7.0 test set
FAQ
It is fine-tuned for Thai speech recognition only.
Input audio must be sampled at 16 kHz, mono, and can be processed via the Wav2Vec2 processor with padding.
On the Common Voice 7.0 test set, it achieves a CER of 2.81% and WER (deepcut) of 8.15%, competitive with Google and Microsoft APIs.
It is released under CC-BY-SA-4.0.
Send a POST request to the gigarouter OpenAI-compatible endpoint with your API key, providing audio data and specifying the model.
We're benchmarking and onboarding Wav2Vec2 Large XLSR-53 Thai as a hosted, OpenAI-compatible API. Sign in for free credit and be ready when it lands, or tell us you want it and we'll prioritize it.