skip to content
gigarouter gigarouter
models / embeddings · coming soon

W2v-BERT 2.0

facebook/w2v-bert-2.0

published Dec 2023 · updated Jan 2024

W2v-BERT 2.0 is a speech encoder model that extracts high-quality audio embeddings from raw speech for downstream tasks like ASR and audio classification.

est. price
~$0.008
/ 1M tokens · estimated, set at launch
API providers
0
downloads / mo
3.7M
license
mit

specs

TaskSpeech Embedding
ArchitectureConformer-based W2v-BERT 2.0
Parameters600M
Pre-training Data4.5M hours of unlabeled audio, 143+ languages

about this model

W2v-BERT 2.0 is a speech encoder model that converts raw audio into dense embeddings for downstream tasks such as automatic speech recognition and audio classification. Built on a Conformer architecture, it was pre-trained on 4.5M hours of unlabeled audio spanning over 143 languages, providing a multilingual foundation that excels on low-resource languages. The model is designed to be fine-tuned for specific tasks; it is not intended for zero-shot inference.

Key capabilities

  • Produces high-quality audio embeddings from the top encoder layer.
  • Part of the Seamless communication family, enabling expressive and streaming speech translation pipelines.
  • After fine-tuning on a low-resource language (Mongolian ASR, ~14h of training data), achieves word error rates comparable to Whisper-large-v3 while being 10x–30x faster and 2.5x more resource-efficient (benchmarked on a 16GB V100 GPU).
  • Whisper-large-v3 produces >100% WER on Mongolian without fine-tuning, highlighting W2v-BERT 2.0’s advantage for low-resource languages.

Model details

ModelParametersCheckpoint
W2v-BERT 2.0600MDownload

Architecture highlights

  • Conformer-based encoder with 600 million parameters.
  • Trained on 4.5M hours of unlabeled audio in 143+ languages.
  • Requires fine-tuning with a task-specific head for ASR, audio classification, or similar tasks.
  • Supported by the Hugging Face Transformers library and the Seamless Communication framework.

best for

FAQ

What languages does W2v-BERT 2.0 support?

It was pre-trained on over 143 languages using 4.5 million hours of unlabeled audio data.

How many parameters does W2v-BERT 2.0 have?

The model has 600M parameters (580M per some sources) and is a Conformer-based encoder.

How does W2v-BERT 2.0 compare to Whisper in speed and accuracy?

After fine-tuning, it achieves similar WER to Whisper-large-v3 while being 10x–30x faster and 2.5x more resource-efficient on a 16GB GPU.

Can I use W2v-BERT 2.0 without fine-tuning?

Yes, you can extract audio embeddings directly from raw audio using the model's top layer. Fine-tuning is required for specific tasks like ASR.

How do I call W2v-BERT 2.0 via the gigarouter API?

Use the gigarouter OpenAI-compatible endpoint with your API key and pass raw audio as input to the embeddings endpoint.

not yet live

We're benchmarking and onboarding W2v-BERT 2.0 as a hosted, OpenAI-compatible API. Sign in for free credit and be ready when it lands, or tell us you want it and we'll prioritize it.

related embeddings models

compare all →