skip to content
gigarouter gigarouter
models / speech-to-text · coming soon

Wav2Vec2 Large XLSR-53 Thai

airesearch/wav2vec2-large-xlsr-53-th

published Mar 2022 · updated Mar 2022

Wav2Vec2 Large XLSR-53 Thai is an automatic speech recognition model fine-tuned for Thai language on Common Voice 7.0.

status
coming soon
API providers
0
downloads / mo
1.7M
license
cc-by-sa-4.0

specs

TaskAutomatic Speech Recognition (ASR)
ArchitectureWav2Vec2 large XLSR-53
LicenseCC-BY-SA-4.0

about this model

airesearch/wav2vec2-large-xlsr-53-th is an automatic speech recognition (ASR) model fine-tuned from Facebook's wav2vec2-large-xlsr-53 on Thai Common Voice 7.0. It transcribes Thai speech into text and is optimized for low character error rate (CER) and word error rate (WER).

The model was trained on 133 validated hours of Thai speech from Common Voice Corpus 7.0, with a single V100 GPU, and selected based on lowest validation loss. Its performance is benchmarked against several commercial APIs on the test set using both PyThaiNLP and deepcut tokenization.

Benchmark Results (Common Voice 7 test set)

SystemWER (PyThaiNLP 2.3.1)WER (deepcut)CER
Kaldi from scratch (baseline)23.047.57
Ours without spell correction13.638.152.81
Ours with spell correction18.0014.175.23
Google Web Speech API13.7110.867.36
Microsoft Bing Speech API12.589.625.02
Amazon Transcribe21.8614.497.08
NECTEC AI for Thai Partii API20.1115.529.55

Note: Commercial APIs were not fine-tuned on Common Voice 7.0 data.

Additional Tokenization Benchmark (robust-speech-event)

TokenizationWER (PyThaiNLP 2.3.1)WER (deepcut)SERCER
Only Tokenization0.9524%2.5316%1.2346%0.1623%

These results reflect perfect transcription after tokenization alignment; the primary benchmark against commercial APIs provides the realistic performance for production use.

The model achieves the lowest CER (2.81%) among all systems tested without spell correction, and competitive WER against leading cloud APIs. It is licensed under CC-BY-SA 4.0.

best for

FAQ

What language does this model support?

It is fine-tuned for Thai speech recognition only.

What audio format does the model expect?

Input audio must be sampled at 16 kHz, mono, and can be processed via the Wav2Vec2 processor with padding.

How does this model compare to commercial Thai ASR APIs?

On the Common Voice 7.0 test set, it achieves a CER of 2.81% and WER (deepcut) of 8.15%, competitive with Google and Microsoft APIs.

What is the license of this model?

It is released under CC-BY-SA-4.0.

How can I use this model via the gigarouter API?

Send a POST request to the gigarouter OpenAI-compatible endpoint with your API key, providing audio data and specifying the model.

not yet live

We're benchmarking and onboarding Wav2Vec2 Large XLSR-53 Thai as a hosted, OpenAI-compatible API. Sign in for free credit and be ready when it lands, or tell us you want it and we'll prioritize it.

related speech-to-text models

compare all →