W2v-BERT 2.0

facebook/w2v-bert-2.0

published Dec 2023 · updated Jan 2024

W2v-BERT 2.0 is a speech encoder model that extracts high-quality audio embeddings from raw speech for downstream tasks like ASR and audio classification.

est. price

~$0.008

/ 1M tokens · estimated, set at launch

API providers

downloads / mo

3.7M

license

mit

specs

Task	Speech Embedding
Architecture	Conformer-based W2v-BERT 2.0
Parameters	600M
Pre-training Data	4.5M hours of unlabeled audio, 143+ languages

about this model

W2v-BERT 2.0 is a speech encoder model that converts raw audio into dense embeddings for downstream tasks such as automatic speech recognition and audio classification. Built on a Conformer architecture, it was pre-trained on 4.5M hours of unlabeled audio spanning over 143 languages, providing a multilingual foundation that excels on low-resource languages. The model is designed to be fine-tuned for specific tasks; it is not intended for zero-shot inference.

Key capabilities

Produces high-quality audio embeddings from the top encoder layer.
Part of the Seamless communication family, enabling expressive and streaming speech translation pipelines.
After fine-tuning on a low-resource language (Mongolian ASR, ~14h of training data), achieves word error rates comparable to Whisper-large-v3 while being 10x–30x faster and 2.5x more resource-efficient (benchmarked on a 16GB V100 GPU).
Whisper-large-v3 produces >100% WER on Mongolian without fine-tuning, highlighting W2v-BERT 2.0’s advantage for low-resource languages.

Model details

Model	Parameters	Checkpoint
W2v-BERT 2.0	600M	Download

Architecture highlights

Conformer-based encoder with 600 million parameters.
Trained on 4.5M hours of unlabeled audio in 143+ languages.
Requires fine-tuning with a task-specific head for ASR, audio classification, or similar tasks.
Supported by the Hugging Face Transformers library and the Seamless Communication framework.

best for

·Fine-tuning for automatic speech recognition (ASR) on low-resource languages
·Extracting audio embeddings for speaker verification or audio classification
·Building speech translation systems (e.g., Seamless)

FAQ

What languages does W2v-BERT 2.0 support?

It was pre-trained on over 143 languages using 4.5 million hours of unlabeled audio data.

How many parameters does W2v-BERT 2.0 have?

The model has 600M parameters (580M per some sources) and is a Conformer-based encoder.

How does W2v-BERT 2.0 compare to Whisper in speed and accuracy?

After fine-tuning, it achieves similar WER to Whisper-large-v3 while being 10x–30x faster and 2.5x more resource-efficient on a 16GB GPU.

Can I use W2v-BERT 2.0 without fine-tuning?

Yes, you can extract audio embeddings directly from raw audio using the model's top layer. Fine-tuning is required for specific tasks like ASR.

How do I call W2v-BERT 2.0 via the gigarouter API?

Use the gigarouter OpenAI-compatible endpoint with your API key and pass raw audio as input to the embeddings endpoint.

not yet live

We're benchmarking and onboarding W2v-BERT 2.0 as a hosted, OpenAI-compatible API. Sign in for free credit and be ready when it lands, or tell us you want it and we'll prioritize it.

related embeddings models

compare all →

nomic-embed-text-v1.5

granite-embedding-small-english-r2

2.2M dl/mo