wespeaker voxceleb resnet34 LM
pyannote/wespeaker-voxceleb-resnet34-LM
published Nov 2023 · updated May 2024
A popular open specialist model model, with 6.8M downloads a month. gigarouter benchmarks and hosts it as an OpenAI-compatible API.
specs
| Task | Speaker Embedding & Verification |
| Architecture | ResNet34 with Large-Margin fine-tuning (LM) |
| Training Data | VoxCeleb (7,000+ speakers, 2,000+ hours) |
| License | Creative Commons Attribution 4.0 International |
about this model
pyannote/wespeaker-voxceleb-resnet34-LM is a speaker embedding model that extracts fixed-dimensional vector representations from speech audio, enabling speaker verification, diarization, and similarity comparison tasks. It wraps the WeSpeaker ResNet34-LM architecture trained on the VoxCeleb dataset, which contains over 7,000 speakers and 1 million+ utterances recorded in diverse acoustic conditions with background noise, overlapping speech, and varying channel effects.
The "LM" suffix indicates the model underwent large-margin fine-tuning, which improves discrimination for longer audio segments (typically greater than 3 seconds). The model produces a single embedding vector per audio file or excerpt, and cosine distance between embeddings quantifies speaker dissimilarity.
Key capabilities
- Extract whole-file or per-excerpt speaker embeddings (1 x D numpy array)
- Sliding window embedding extraction for temporal analysis
- GPU-accelerated inference via pyannote.audio
- Compatible with speaker verification and diarization pipelines
Benchmark context
The model is part of the WeSpeaker family, which includes larger variants (ResNet152_LM, ResNet221_LM, ResNet293_LM, CAM++, ECAPA512, ECAPA1024) trained on the same VoxCeleb data. The ResNet34-LM offers a balanced trade-off between embedding quality and computational efficiency. The underlying WeSpeaker framework achieved publication at ICASSP 2023.
License
Licensed under Creative Commons Attribution 4.0 International, consistent with the VoxCeleb dataset terms.
FAQ
It is best for extracting speaker embeddings used in speaker verification and diarization tasks.
It is a ResNet34 model fine-tuned with large-margin loss, trained on VoxCeleb with 7,000+ speakers; performance depends on audio length (better for >3s segments).
Input is audio (WAV file), output is a D-dimensional speaker embedding vector (1 x D numpy array for whole file or N x D for sliding windows).
Use the gigarouter OpenAI-compatible endpoint with your API key; send audio data and receive embeddings in the response.
It follows the Creative Commons Attribution 4.0 International License, based on the VoxCeleb dataset license.
We're benchmarking and onboarding wespeaker voxceleb resnet34 LM as a hosted, OpenAI-compatible API. Sign in for free credit and be ready when it lands, or tell us you want it and we'll prioritize it.