Romanian Wav2Vec2

gigant/romanian-wav2vec2

published Mar 2022 · updated Sep 2023

Romanian Wav2Vec2 is a automatic speech recognition (ASR) model for Romanian, fine-tuned from Wav2Vec2-XLS-R-300M with a 5-gram language model.

est. price

~$0.0034

· estimated, set at launch

API providers

downloads / mo

2.8M

license

apache-2.0

specs

Task	Automatic Speech Recognition (ASR)
Architecture	Wav2Vec2-XLS-R-300M with CTC head and 5-gram language model (pyctcdecode + kenlm)
Parameters	~300 million

about this model

gigant/romanian-wav2vec2 is an automatic speech recognition (ASR) model fine-tuned from facebook/wav2vec2-xls-r-300m on the Common Voice 8.0 Romanian subset and additional data from the Romanian Speech Synthesis dataset. The architecture uses a CTC head with a 5-gram language model (built with pyctcdecode and kenlm) trained on the Romanian Corpora Parliament dataset. Audio input must be sampled at 16 kHz; output text is lowercased without punctuation.

The model achieved TOP‑1 on Romanian speech recognition during HuggingFace’s Robust Speech Challenge (Speech Bench; Leaderboard). Without the 5‑gram LM optimization, evaluation on the Common Voice 8.0 Romanian test set yields:

Loss: 0.1553
Word error rate (WER): 0.1174
Character error rate (CER): 0.0294

Training hyperparameters: learning rate 0.003, batch size 48 (gradient accumulation 3), Adam optimizer, linear scheduler with 500 warmup steps, 50 epochs, mixed precision (AMP).

You can test the model online via the Romanian Speech Recognition Space.

best for

·Transcribing Romanian audio recordings
·Building Romanian voice assistants
·Subtitling Romanian media content
·Automating Romanian call center transcription

FAQ

What is this model best for?

It is best for Romanian speech recognition, achieving top-1 performance on the Hugging Face Robust Speech Challenge. It outputs lowercase text without punctuation.

What input format does the model require?

Audio clips sampled at 16kHz. The model predicts text directly from the audio waveform.

How do I call this model via the gigarouter API?

Use the OpenAI-compatible endpoint with your API key. Send audio bytes or a URL to the /v1/audio/transcriptions endpoint.

What are the model's size and speed?

The base model has ~300 million parameters. Speed depends on hardware; it is suitable for both real-time and batch processing.

Does the model include a language model?

Yes, it includes a 5-gram language model trained on Romanian parliamentary data, which boosts accuracy (WER 0.1174 on Common Voice test set).

not yet live

We're benchmarking and onboarding Romanian Wav2Vec2 as a hosted, OpenAI-compatible API. Sign in for free credit and be ready when it lands, or tell us you want it and we'll prioritize it.

related speech-to-text models

compare all →

speaker-diarization-3.1

wav2vec2-large-xlsr-53-japanese

6.1M dl/mo

wav2vec2-large-xlsr-53-polish

4.7M dl/mo

wav2vec2-large-xlsr-53-dutch

4.1M dl/mo