wespeaker voxceleb resnet34 LM

pyannote/wespeaker-voxceleb-resnet34-LM

published Nov 2023 · updated May 2024

A popular open specialist model model, with 6.8M downloads a month. gigarouter benchmarks and hosts it as an OpenAI-compatible API.

status

coming soon

API providers

downloads / mo

6.8M

license

cc-by-4.0

specs

Task	Speaker Embedding & Verification
Architecture	ResNet34 with Large-Margin fine-tuning (LM)
Training Data	VoxCeleb (7,000+ speakers, 2,000+ hours)
License	Creative Commons Attribution 4.0 International

about this model

pyannote/wespeaker-voxceleb-resnet34-LM is a speaker embedding model that extracts fixed-dimensional vector representations from speech audio, enabling speaker verification, diarization, and similarity comparison tasks. It wraps the WeSpeaker ResNet34-LM architecture trained on the VoxCeleb dataset, which contains over 7,000 speakers and 1 million+ utterances recorded in diverse acoustic conditions with background noise, overlapping speech, and varying channel effects.

The "LM" suffix indicates the model underwent large-margin fine-tuning, which improves discrimination for longer audio segments (typically greater than 3 seconds). The model produces a single embedding vector per audio file or excerpt, and cosine distance between embeddings quantifies speaker dissimilarity.

Key capabilities

Extract whole-file or per-excerpt speaker embeddings (1 x D numpy array)
Sliding window embedding extraction for temporal analysis
GPU-accelerated inference via pyannote.audio
Compatible with speaker verification and diarization pipelines

Benchmark context

The model is part of the WeSpeaker family, which includes larger variants (ResNet152_LM, ResNet221_LM, ResNet293_LM, CAM++, ECAPA512, ECAPA1024) trained on the same VoxCeleb data. The ResNet34-LM offers a balanced trade-off between embedding quality and computational efficiency. The underlying WeSpeaker framework achieved publication at ICASSP 2023.

License

Licensed under Creative Commons Attribution 4.0 International, consistent with the VoxCeleb dataset terms.

FAQ

What is the model best for?

It is best for extracting speaker embeddings used in speaker verification and diarization tasks.

How does it compare to other speaker models?

It is a ResNet34 model fine-tuned with large-margin loss, trained on VoxCeleb with 7,000+ speakers; performance depends on audio length (better for >3s segments).

What are the input and output formats?

Input is audio (WAV file), output is a D-dimensional speaker embedding vector (1 x D numpy array for whole file or N x D for sliding windows).

How can I call this model via the gigarouter API?

Use the gigarouter OpenAI-compatible endpoint with your API key; send audio data and receive embeddings in the response.

What license does the model use?

It follows the Creative Commons Attribution 4.0 International License, based on the VoxCeleb dataset license.

not yet live

We're benchmarking and onboarding wespeaker voxceleb resnet34 LM as a hosted, OpenAI-compatible API. Sign in for free credit and be ready when it lands, or tell us you want it and we'll prioritize it.

related specialist model models

compare all →

electra-base-discriminator

stable-diffusion-v1-5-archive

5.8M dl/mo