W2v-BERT 2.0
facebook/w2v-bert-2.0
published Dec 2023 · updated Jan 2024
W2v-BERT 2.0 is a speech encoder model that extracts high-quality audio embeddings from raw speech for downstream tasks like ASR and audio classification.
specs
| Task | Speech Embedding |
| Architecture | Conformer-based W2v-BERT 2.0 |
| Parameters | 600M |
| Pre-training Data | 4.5M hours of unlabeled audio, 143+ languages |
about this model
W2v-BERT 2.0 is a speech encoder model that converts raw audio into dense embeddings for downstream tasks such as automatic speech recognition and audio classification. Built on a Conformer architecture, it was pre-trained on 4.5M hours of unlabeled audio spanning over 143 languages, providing a multilingual foundation that excels on low-resource languages. The model is designed to be fine-tuned for specific tasks; it is not intended for zero-shot inference.
Key capabilities
- Produces high-quality audio embeddings from the top encoder layer.
- Part of the Seamless communication family, enabling expressive and streaming speech translation pipelines.
- After fine-tuning on a low-resource language (Mongolian ASR, ~14h of training data), achieves word error rates comparable to Whisper-large-v3 while being 10x–30x faster and 2.5x more resource-efficient (benchmarked on a 16GB V100 GPU).
- Whisper-large-v3 produces >100% WER on Mongolian without fine-tuning, highlighting W2v-BERT 2.0’s advantage for low-resource languages.
Model details
| Model | Parameters | Checkpoint |
|---|---|---|
| W2v-BERT 2.0 | 600M | Download |
Architecture highlights
- Conformer-based encoder with 600 million parameters.
- Trained on 4.5M hours of unlabeled audio in 143+ languages.
- Requires fine-tuning with a task-specific head for ASR, audio classification, or similar tasks.
- Supported by the Hugging Face Transformers library and the Seamless Communication framework.
best for
- ·Fine-tuning for automatic speech recognition (ASR) on low-resource languages
- ·Extracting audio embeddings for speaker verification or audio classification
- ·Building speech translation systems (e.g., Seamless)
FAQ
It was pre-trained on over 143 languages using 4.5 million hours of unlabeled audio data.
The model has 600M parameters (580M per some sources) and is a Conformer-based encoder.
After fine-tuning, it achieves similar WER to Whisper-large-v3 while being 10x–30x faster and 2.5x more resource-efficient on a 16GB GPU.
Yes, you can extract audio embeddings directly from raw audio using the model's top layer. Fine-tuning is required for specific tasks like ASR.
Use the gigarouter OpenAI-compatible endpoint with your API key and pass raw audio as input to the embeddings endpoint.
We're benchmarking and onboarding W2v-BERT 2.0 as a hosted, OpenAI-compatible API. Sign in for free credit and be ready when it lands, or tell us you want it and we'll prioritize it.