WavLM Base Plus

microsoft/wavlm-base-plus

published Mar 2022 · updated Dec 2021

WavLM Base Plus is a self-supervised speech pre-trained model built on the HuBERT framework, designed for full-stack speech processing tasks including recognition, speaker verification, and audio classification.

status

coming soon

API providers

downloads / mo

771.5K

specs

Task	Speech Representation Learning (Self-Supervised Pre-training)
Architecture	HuBERT with gated relative position bias
Training Data	94k hours: Libri-Light (60k), GigaSpeech (10k), VoxPopuli (24k)
Input	16kHz mono waveform
License	MIT

about this model

Microsoft WavLM-Base-Plus is a self-supervised speech embedding model that generates general-purpose representations from raw audio, optimized for 16 kHz sampled speech. Built on the HuBERT framework, it incorporates gated relative position bias in the Transformer layers to improve sequence modeling and an utterance mixing training strategy that overlays unsupervised speech segments to enhance speaker discrimination.

The model was pre-trained on 94,000 hours of English audio drawn from three large-scale corpora: Libri-Light (60,000 hours of audiobooks), GigaSpeech (10,000 hours of multi-domain transcribed speech), and VoxPopuli (24,000 hours of unlabeled speech from European Parliament proceedings). This diverse training setup enables the model to learn robust, multi-faceted representations that capture spoken content, speaker identity, and paralinguistic cues.

The WavLM architecture achieved state-of-the-art results on the SUPERB benchmark (with the Large variant), and the base-plus variant is well-suited for tasks such as speaker recognition, audio classification, and speech separation when fine-tuned or used as a feature extractor. As a hosted API through gigarouter, the model accepts 16 kHz audio input and returns high-dimensional embeddings without requiring local installation or GPU resources.

Schematic overview of the WavLM pre-training and fine-tuning pipeline.

Refer to the WavLM paper and the official repository for further architectural details and evaluation protocols.

best for

·Fine-tuning for automatic speech recognition (ASR)
·Fine-tuning for speaker verification and diarization
·Fine-tuning for audio classification tasks

FAQ

What speech tasks can WavLM Base Plus be fine-tuned for?

It supports ASR, speaker verification, speaker diarization, and audio classification, as shown in the SUPERB benchmark.

How does WavLM Base Plus differ from Wav2Vec2 or HuBERT?

It adds gated relative position bias for better content modeling and an utterance mixing training strategy for improved speaker discrimination.

Is WavLM Base Plus available as a hosted API?

Yes, on gigarouter via an OpenAI-compatible endpoint using an API key.

What is the license for WavLM Base Plus?

The model is released under the MIT License (see UniSpeech license).

not yet live

We're benchmarking and onboarding WavLM Base Plus as a hosted, OpenAI-compatible API. Sign in for free credit and be ready when it lands, or tell us you want it and we'll prioritize it.

related embeddings models

compare all →

nomic-embed-text-v1.5