models / speech-to-text · coming soon

Wav2Vec2 Large XLSR-53 Thai

airesearch/wav2vec2-large-xlsr-53-th

published Mar 2022 · updated Mar 2022

Wav2Vec2 Large XLSR-53 Thai is an automatic speech recognition model fine-tuned for Thai language on Common Voice 7.0.

status

coming soon

API providers

downloads / mo

1.7M

license

cc-by-sa-4.0

specs

Task	Automatic Speech Recognition (ASR)
Architecture	Wav2Vec2 large XLSR-53
License	CC-BY-SA-4.0

about this model

airesearch/wav2vec2-large-xlsr-53-th is an automatic speech recognition (ASR) model fine-tuned from Facebook's wav2vec2-large-xlsr-53 on Thai Common Voice 7.0. It transcribes Thai speech into text and is optimized for low character error rate (CER) and word error rate (WER).

The model was trained on 133 validated hours of Thai speech from Common Voice Corpus 7.0, with a single V100 GPU, and selected based on lowest validation loss. Its performance is benchmarked against several commercial APIs on the test set using both PyThaiNLP and deepcut tokenization.

Benchmark Results (Common Voice 7 test set)

System	WER (PyThaiNLP 2.3.1)	WER (deepcut)	CER
Kaldi from scratch (baseline)	23.04	—	7.57
Ours without spell correction	13.63	8.15	2.81
Ours with spell correction	18.00	14.17	5.23
Google Web Speech API	13.71	10.86	7.36
Microsoft Bing Speech API	12.58	9.62	5.02
Amazon Transcribe	21.86	14.49	7.08
NECTEC AI for Thai Partii API	20.11	15.52	9.55

Note: Commercial APIs were not fine-tuned on Common Voice 7.0 data.

Additional Tokenization Benchmark (robust-speech-event)

Tokenization	WER (PyThaiNLP 2.3.1)	WER (deepcut)	SER	CER
Only Tokenization	0.9524%	2.5316%	1.2346%	0.1623%

These results reflect perfect transcription after tokenization alignment; the primary benchmark against commercial APIs provides the realistic performance for production use.

The model achieves the lowest CER (2.81%) among all systems tested without spell correction, and competitive WER against leading cloud APIs. It is licensed under CC-BY-SA 4.0.

best for

·Transcribing Thai speech audio into text
·Building Thai voice-enabled applications
·Benchmarking Thai ASR with Common Voice 7.0 test set

FAQ

What language does this model support?

It is fine-tuned for Thai speech recognition only.

What audio format does the model expect?

Input audio must be sampled at 16 kHz, mono, and can be processed via the Wav2Vec2 processor with padding.

How does this model compare to commercial Thai ASR APIs?

On the Common Voice 7.0 test set, it achieves a CER of 2.81% and WER (deepcut) of 8.15%, competitive with Google and Microsoft APIs.

What is the license of this model?

It is released under CC-BY-SA-4.0.

How can I use this model via the gigarouter API?

Send a POST request to the gigarouter OpenAI-compatible endpoint with your API key, providing audio data and specifying the model.

not yet live

We're benchmarking and onboarding Wav2Vec2 Large XLSR-53 Thai as a hosted, OpenAI-compatible API. Sign in for free credit and be ready when it lands, or tell us you want it and we'll prioritize it.

related speech-to-text models

compare all →

speaker-diarization-3.1

wav2vec2-large-xlsr-53-japanese

6.1M dl/mo

wav2vec2-large-xlsr-53-polish

4.7M dl/mo

wav2vec2-large-xlsr-53-dutch

4.1M dl/mo