Wav2Vec2 XLS-R 300M Hebrew
imvladikon/wav2vec2-xls-r-300m-hebrew
published Mar 2022 · updated Sep 2023
Wav2Vec2 XLS-R 300M Hebrew is an automatic speech recognition model fine-tuned from Facebook's XLS-R 300M for transcribing Hebrew speech.
specs
| Task | Automatic Speech Recognition |
| Architecture | Wav2Vec2-XLS-R |
| Parameters | 300M |
| License | Apache-2.0 |
about this model
imvladikon/wav2vec2-xls-r-300m-hebrew is an automatic speech recognition (ASR) model fine-tuned from facebook/wav2vec2-xls-r-300m, a 300-million-parameter multilingual model pretrained on 436k hours of unlabeled speech across 128 languages (including Hebrew) from VoxPopuli, MLS, CommonVoice, BABEL, and VoxLingua107. The base XLS-R model achieves state-of-the-art results on VoxLingua107 language identification and a 14-34% relative error rate reduction on ASR benchmarks.
This Hebrew ASR model was fine-tuned in two stages: first on a small curated dataset (28 hours, 20,306 samples), then on a larger mixed dataset (69 hours, 90,777 samples) that included weakly labeled data from a previously trained model. After the second training stage, the model achieved:
- WER 0.1697 on the small dataset
- WER 0.2318 and loss 0.4502 on the large dataset
Training used multi-GPU (2 devices) with gradient accumulation (total batch size 64), a learning rate of 0.0003, linear scheduler with 1000 warmup steps, and Native AMP mixed precision. The base model is released under the Apache-2.0 license.
Training data
| Dataset | Size (GB) | Samples | Duration (hrs) |
|---|---|---|---|
| Small train | 4.19 | 20,306 | 28 |
| Small dev | 1.05 | 5,076 | 7 |
| Large train | 12.3 | 90,777 | 69 |
| Large dev | 2.39 | 20,246 | 14* |
*Weakly labeled data was not used in the validation set.
best for
- ·Transcribing Hebrew speech in audio files
- ·Building Hebrew voice-to-text applications
- ·Subtitling Hebrew audio content
FAQ
The model accepts raw audio signals (typically 16 kHz mono WAV) as input and outputs transcribed Hebrew text.
Use the gigarouter OpenAI-compatible endpoint with your API key, sending audio data and specifying the model name.
On the private large dataset, the final WER is 0.2318; on the small dataset, it is 0.1697.
The base model is licensed under Apache-2.0, and the fine-tuned model inherits that license.
Yes, the underlying Wav2Vec2 architecture is efficient, but latency depends on hardware and audio length. The model is optimized for batch processing.
We're benchmarking and onboarding Wav2Vec2 XLS-R 300M Hebrew as a hosted, OpenAI-compatible API. Sign in for free credit and be ready when it lands, or tell us you want it and we'll prioritize it.