Wav2Vec2 XLSR-53 Large Chinese
jonatasgrosman/wav2vec2-large-xlsr-53-chinese-zh-cn
published Mar 2022 · updated Dec 2022
XLSR-53 Large Chinese is an automatic speech recognition model fine-tuned on Chinese speech datasets for transcribing audio to text.
specs
| Task | Automatic Speech Recognition (ASR) |
| Architecture | Wav2Vec2 Large XLSR-53 |
| Training Datasets | Common Voice 6.1 (zh-CN), CSS10, ST-CMDS |
| Input Sampling Rate | 16 kHz |
| License | Apache 2.0 (base model) |
about this model
jonatasgrosman/wav2vec2-large-xlsr-53-chinese-zh-cn is an automatic speech recognition (ASR) model fine-tuned from Facebook’s Wav2Vec2-XLSR-53 large checkpoint for Chinese (zh-CN) speech. It was trained on the train and validation splits of Common Voice 6.1, CSS10, and ST-CMDS, and expects 16 kHz mono audio input.
The model achieves a Word Error Rate (WER) of 82.37% and a Character Error Rate (CER) of 19.03% on the Common Voice zh-CN test set. The following table compares it to a sibling Chinese XLSR-53 model:
| Model | WER | CER |
|---|---|---|
| jonatasgrosman/wav2vec2-large-xlsr-53-chinese-zh-cn | 82.37% | 19.03% |
| ydshieh/wav2vec2-large-xlsr-53-chinese-zh-cn-gpt | 84.01% | 20.95% |
The base XLSR-53 model is released under the Apache-2.0 license. This Chinese fine-tune is part of a larger multilingual suite covering 18 languages, and its training repository (wav2vec2-sprint) is now deprecated in favor of the HuggingSound library. When hosted via gigarouter’s API, no local installation is required; the model is served as a managed, OpenAI-compatible endpoint.
best for
- ·Transcribing Mandarin Chinese speech from audio files
- ·Building Chinese voice-controlled applications
- ·Generating captions for Chinese audiobooks and podcasts
FAQ
Word Error Rate: 82.37%, Character Error Rate: 19.03%.
Input audio must be sampled at 16 kHz. Any format convertible to a 16 kHz waveform is supported.
Send requests to the gigarouter OpenAI-compatible endpoint with your API key. Refer to gigarouter documentation for exact URL and request format.
The HuggingSound library supports language model boosted decoding (e.g., KenshoLMDecoder), which may improve results beyond the baseline WER/CER.
The model is derived from facebook/wav2vec2-large-xlsr-53 which is released under Apache 2.0.
We're benchmarking and onboarding Wav2Vec2 XLSR-53 Large Chinese as a hosted, OpenAI-compatible API. Sign in for free credit and be ready when it lands, or tell us you want it and we'll prioritize it.