Speaker Diarization 3.1

pyannote/speaker-diarization-3.1

published Nov 2023 · updated May 2024

Speaker Diarization 3.1 is a speaker diarization pipeline that identifies "who spoke when" in an audio recording using pure PyTorch for segmentation and embedding.

status

coming soon

API providers

downloads / mo

8.2M

license

mit

specs

Task	Speaker Diarization
Architecture	PyTorch pipeline with speaker segmentation and embedding models
License	MIT

about this model

pyannote/speaker-diarization-3.1 is a speaker diarization pipeline that identifies "who spoke when" in mono audio sampled at 16 kHz. It combines a neural speaker segmentation model and a speaker embedding model, both running in pure PyTorch with no ONNX Runtime dependency, to produce an annotation of speaker turns.

Key characteristics

Fully automatic diarization: no manual voice activity detection, no pre-specified number of speakers (optional num_speakers, min_speakers, max_speakers parameters accepted).
Automatically downmixes stereo and multi-channel audio to mono and resamples to 16 kHz on load.
Pure PyTorch execution simplifies deployment and can improve inference speed compared to the prior version.

Benchmark results

Evaluated under the most demanding diarization error rate (DER) configuration (no forgiveness collar, overlapped speech included) across nine public datasets:

Dataset	DER (%)
AISHELL-4	12.2
AliMeeting (channel 1)	24.4
AMI (headset mix, only_words)	18.8
AMI (array1, channel 1, only_words)	22.4
AVA-AVD	50.0
DIHARD 3 (Full)	21.7
MSDWild	25.3
REPERE (phase 2)	7.8
VoxConverse (v0.3)	11.3

Additional benchmarks (CALLHOME part 2, Ego4D dev., RAMC) are reported in the pyannote.audio repository. The pipeline is released under the MIT license.

best for

·Transcribing multi-speaker conference calls
·Analyzing customer support recordings
·Segmenting podcast guests by speaker

FAQ

What input formats does this model accept?

The pipeline accepts mono audio sampled at 16 kHz. Stereo or multi-channel files are downmixed to mono, and files with a different sample rate are resampled to 16 kHz.

Can I specify the number of speakers?

Yes, you can set num_speakers, min_speakers, or max_speakers when calling the pipeline.

Does it run on GPU?

Yes, you can send the pipeline to a CUDA device with pipeline.to(torch.device("cuda")).

How do I use this model via the gigarouter API?

Call the gigarouter OpenAI-compatible endpoint with your API key and the model name "pyannote/speaker-diarization-3.1", sending an audio file as input.

What is the output format?

The pipeline returns a pyannote Annotation instance with speaker turns and can be exported to RTTM format using write_rttm().

not yet live

We're benchmarking and onboarding Speaker Diarization 3.1 as a hosted, OpenAI-compatible API. Sign in for free credit and be ready when it lands, or tell us you want it and we'll prioritize it.

related speech-to-text models

wav2vec2-large-xlsr-53-japanese

6.1M dl/mo

wav2vec2-large-xlsr-53-polish

4.7M dl/mo

wav2vec2-large-xlsr-53-dutch

4.1M dl/mo

wav2vec2-indonesian-javanese-sundanese

4.1M dl/mo