Wav2Vec2 XLS-R 300M Hebrew

imvladikon/wav2vec2-xls-r-300m-hebrew

published Mar 2022 · updated Sep 2023

Wav2Vec2 XLS-R 300M Hebrew is an automatic speech recognition model fine-tuned from Facebook's XLS-R 300M for transcribing Hebrew speech.

est. price

~$0.0034

· estimated, set at launch

API providers

downloads / mo

1.8M

specs

Task	Automatic Speech Recognition
Architecture	Wav2Vec2-XLS-R
Parameters	300M
License	Apache-2.0

about this model

imvladikon/wav2vec2-xls-r-300m-hebrew is an automatic speech recognition (ASR) model fine-tuned from facebook/wav2vec2-xls-r-300m, a 300-million-parameter multilingual model pretrained on 436k hours of unlabeled speech across 128 languages (including Hebrew) from VoxPopuli, MLS, CommonVoice, BABEL, and VoxLingua107. The base XLS-R model achieves state-of-the-art results on VoxLingua107 language identification and a 14-34% relative error rate reduction on ASR benchmarks.

This Hebrew ASR model was fine-tuned in two stages: first on a small curated dataset (28 hours, 20,306 samples), then on a larger mixed dataset (69 hours, 90,777 samples) that included weakly labeled data from a previously trained model. After the second training stage, the model achieved:

WER 0.1697 on the small dataset
WER 0.2318 and loss 0.4502 on the large dataset

Training used multi-GPU (2 devices) with gradient accumulation (total batch size 64), a learning rate of 0.0003, linear scheduler with 1000 warmup steps, and Native AMP mixed precision. The base model is released under the Apache-2.0 license.

Training data

Dataset	Size (GB)	Samples	Duration (hrs)
Small train	4.19	20,306	28
Small dev	1.05	5,076	7
Large train	12.3	90,777	69
Large dev	2.39	20,246	14*

*Weakly labeled data was not used in the validation set.

best for

·Transcribing Hebrew speech in audio files
·Building Hebrew voice-to-text applications
·Subtitling Hebrew audio content

FAQ

What is the input format for this model?

The model accepts raw audio signals (typically 16 kHz mono WAV) as input and outputs transcribed Hebrew text.

How do I call this model via the API?

Use the gigarouter OpenAI-compatible endpoint with your API key, sending audio data and specifying the model name.

What WER does this model achieve?

On the private large dataset, the final WER is 0.2318; on the small dataset, it is 0.1697.

What is the license of this model?

The base model is licensed under Apache-2.0, and the fine-tuned model inherits that license.

Is this model suitable for real-time transcription?

Yes, the underlying Wav2Vec2 architecture is efficient, but latency depends on hardware and audio length. The model is optimized for batch processing.

not yet live

We're benchmarking and onboarding Wav2Vec2 XLS-R 300M Hebrew as a hosted, OpenAI-compatible API. Sign in for free credit and be ready when it lands, or tell us you want it and we'll prioritize it.

related speech-to-text models

compare all →

speaker-diarization-3.1

wav2vec2-large-xlsr-53-japanese

6.1M dl/mo

wav2vec2-large-xlsr-53-polish

4.7M dl/mo

wav2vec2-large-xlsr-53-dutch

4.1M dl/mo