Speaker Diarization 3.1
pyannote/speaker-diarization-3.1
published Nov 2023 · updated May 2024
Speaker Diarization 3.1 is a speaker diarization pipeline that identifies "who spoke when" in an audio recording using pure PyTorch for segmentation and embedding.
specs
| Task | Speaker Diarization |
| Architecture | PyTorch pipeline with speaker segmentation and embedding models |
| License | MIT |
about this model
pyannote/speaker-diarization-3.1 is a speaker diarization pipeline that identifies "who spoke when" in mono audio sampled at 16 kHz. It combines a neural speaker segmentation model and a speaker embedding model, both running in pure PyTorch with no ONNX Runtime dependency, to produce an annotation of speaker turns.
Key characteristics
- Fully automatic diarization: no manual voice activity detection, no pre-specified number of speakers (optional
num_speakers,min_speakers,max_speakersparameters accepted). - Automatically downmixes stereo and multi-channel audio to mono and resamples to 16 kHz on load.
- Pure PyTorch execution simplifies deployment and can improve inference speed compared to the prior version.
Benchmark results
Evaluated under the most demanding diarization error rate (DER) configuration (no forgiveness collar, overlapped speech included) across nine public datasets:
| Dataset | DER (%) |
|---|---|
| AISHELL-4 | 12.2 |
| AliMeeting (channel 1) | 24.4 |
| AMI (headset mix, only_words) | 18.8 |
| AMI (array1, channel 1, only_words) | 22.4 |
| AVA-AVD | 50.0 |
| DIHARD 3 (Full) | 21.7 |
| MSDWild | 25.3 |
| REPERE (phase 2) | 7.8 |
| VoxConverse (v0.3) | 11.3 |
Additional benchmarks (CALLHOME part 2, Ego4D dev., RAMC) are reported in the pyannote.audio repository. The pipeline is released under the MIT license.
best for
- ·Transcribing multi-speaker conference calls
- ·Analyzing customer support recordings
- ·Segmenting podcast guests by speaker
FAQ
The pipeline accepts mono audio sampled at 16 kHz. Stereo or multi-channel files are downmixed to mono, and files with a different sample rate are resampled to 16 kHz.
Yes, you can set num_speakers, min_speakers, or max_speakers when calling the pipeline.
Yes, you can send the pipeline to a CUDA device with pipeline.to(torch.device("cuda")).
Call the gigarouter OpenAI-compatible endpoint with your API key and the model name "pyannote/speaker-diarization-3.1", sending an audio file as input.
The pipeline returns a pyannote Annotation instance with speaker turns and can be exported to RTTM format using write_rttm().
We're benchmarking and onboarding Speaker Diarization 3.1 as a hosted, OpenAI-compatible API. Sign in for free credit and be ready when it lands, or tell us you want it and we'll prioritize it.