Wav2Vec2 XLSR-53 Large Chinese

jonatasgrosman/wav2vec2-large-xlsr-53-chinese-zh-cn

published Mar 2022 · updated Dec 2022

XLSR-53 Large Chinese is an automatic speech recognition model fine-tuned on Chinese speech datasets for transcribing audio to text.

status

coming soon

API providers

downloads / mo

1.5M

license

apache-2.0

specs

Task	Automatic Speech Recognition (ASR)
Architecture	Wav2Vec2 Large XLSR-53
Training Datasets	Common Voice 6.1 (zh-CN), CSS10, ST-CMDS
Input Sampling Rate	16 kHz
License	Apache 2.0 (base model)

about this model

jonatasgrosman/wav2vec2-large-xlsr-53-chinese-zh-cn is an automatic speech recognition (ASR) model fine-tuned from Facebook’s Wav2Vec2-XLSR-53 large checkpoint for Chinese (zh-CN) speech. It was trained on the train and validation splits of Common Voice 6.1, CSS10, and ST-CMDS, and expects 16 kHz mono audio input.

The model achieves a Word Error Rate (WER) of 82.37% and a Character Error Rate (CER) of 19.03% on the Common Voice zh-CN test set. The following table compares it to a sibling Chinese XLSR-53 model:

Model	WER	CER
jonatasgrosman/wav2vec2-large-xlsr-53-chinese-zh-cn	82.37%	19.03%
ydshieh/wav2vec2-large-xlsr-53-chinese-zh-cn-gpt	84.01%	20.95%

The base XLSR-53 model is released under the Apache-2.0 license. This Chinese fine-tune is part of a larger multilingual suite covering 18 languages, and its training repository (wav2vec2-sprint) is now deprecated in favor of the HuggingSound library. When hosted via gigarouter’s API, no local installation is required; the model is served as a managed, OpenAI-compatible endpoint.

best for

·Transcribing Mandarin Chinese speech from audio files
·Building Chinese voice-controlled applications
·Generating captions for Chinese audiobooks and podcasts

FAQ

What is the WER and CER of this model on Common Voice zh-CN test?

Word Error Rate: 82.37%, Character Error Rate: 19.03%.

What audio format and sampling rate does the model require?

Input audio must be sampled at 16 kHz. Any format convertible to a 16 kHz waveform is supported.

How can I use this model via the gigarouter API?

Send requests to the gigarouter OpenAI-compatible endpoint with your API key. Refer to gigarouter documentation for exact URL and request format.

Is a language model available to improve accuracy?

The HuggingSound library supports language model boosted decoding (e.g., KenshoLMDecoder), which may improve results beyond the baseline WER/CER.

What license does this model use?

The model is derived from facebook/wav2vec2-large-xlsr-53 which is released under Apache 2.0.

not yet live

We're benchmarking and onboarding Wav2Vec2 XLSR-53 Large Chinese as a hosted, OpenAI-compatible API. Sign in for free credit and be ready when it lands, or tell us you want it and we'll prioritize it.

related speech-to-text models

compare all →

speaker-diarization-3.1

wav2vec2-large-xlsr-53-japanese

6.1M dl/mo

wav2vec2-large-xlsr-53-polish

4.7M dl/mo

wav2vec2-large-xlsr-53-dutch

4.1M dl/mo