Wav2Vec2 Large XLSR-53 Japanese

jonatasgrosman/wav2vec2-large-xlsr-53-japanese

published Mar 2022 · updated Dec 2022

Wav2Vec2 Large XLSR-53 Japanese is a speech recognition model fine-tuned from Facebook's wav2vec2-large-xlsr-53 on Japanese datasets including Common Voice 6.1, CSS10, and JSUT.

status

coming soon

API providers

downloads / mo

6.1M

license

apache-2.0

specs

Task	Automatic Speech Recognition (ASR)
Architecture	Wav2Vec2-Large-XLSR-53
License	Apache 2.0
Language	Japanese

about this model

jonatasgrosman/wav2vec2-large-xlsr-53-japanese is an automatic speech recognition (ASR) model that transcribes Japanese speech into text. It is fine-tuned from facebook/wav2vec2-large-xlsr-53 on the train and validation splits of Common Voice 6.1, CSS10, and JSUT datasets. The model requires audio input sampled at 16 kHz. It is released under the Apache-2.0 license and has a registered DOI: 10.57967/hf/3568.

Evaluation Results

The model was evaluated on the Japanese test split of Common Voice 6.1 (evaluation date: 2021-05-10). Word Error Rate (WER) and Character Error Rate (CER) are reported below alongside results for other publicly available Japanese XLSR-53 fine-tuned models.

Model	WER	CER
jonatasgrosman/wav2vec2-large-xlsr-53-japanese	81.80%	20.16%
vumichien/wav2vec2-large-xlsr-japanese	1108.86%	23.40%
qqhann/w2v_hf_jsut_xlsr53	1012.18%	70.77%

Additional Context

This model is part of a family of 15 XLSR-53 fine-tuned models covering Arabic, Chinese, Dutch, Finnish, French, German, Greek, Hungarian, Italian, Japanese, Persian, Polish, Portuguese, Russian, and Spanish. The training and evaluation scripts used are available in the wav2vec2-sprint repository (now deprecated in favor of the HuggingSound library). As a hosted API on gigarouter, the model is ready for production use without requiring local setup or dependency management.

best for

·Transcribing Japanese audio recordings
·Subtitling Japanese videos
·Building voice interfaces for Japanese applications

FAQ

What audio format does the model require?

The model expects speech input sampled at 16 kHz, mono audio.

How do I call this model via the gigarouter API?

Use the gigarouter OpenAI-compatible endpoint with your API key, passing audio data in the request.

What word error rate (WER) does this model achieve?

On the Common Voice Japanese test set, it reports a WER of 81.80% and CER of 20.16% (evaluated May 2021).

What is the license of this model?

It is released under the Apache 2.0 license.

Does this model support languages other than Japanese?

No, this model is fine-tuned exclusively for Japanese speech recognition.

not yet live

We're benchmarking and onboarding Wav2Vec2 Large XLSR-53 Japanese as a hosted, OpenAI-compatible API. Sign in for free credit and be ready when it lands, or tell us you want it and we'll prioritize it.

related speech-to-text models

compare all →

speaker-diarization-3.1

wav2vec2-large-xlsr-53-polish

4.7M dl/mo

wav2vec2-large-xlsr-53-dutch

4.1M dl/mo

wav2vec2-indonesian-javanese-sundanese

4.1M dl/mo