Awesome Speaker Diarization

Overview
Publications
Software
Datasets
- Diarization datasets
- Speaker embedding training sets
Leaderboards
Other learning materials
- Tech blogs
- Video tutorials
Products

Overview

This is a curated list of awesome Speaker Diarization papers, libraries, datasets, and other resources.

The purpose of this repo is to organize the world’s resources for speaker diarization, and make them universally accessible and useful.

To add items to this page, simply send a pull request. (contributing guide)

Publications

Special topics

Supervisied diarization

Joint diarization and ASR

Challenges

Other

2020

Speaker Diarization with Region Proposal Network

2019

2018

2017

2016

A Speaker Diarization System for Studying Peer-Led Team Learning Groups

2015

Diarization resegmentation in the factor analysis subspace

2014

2013

Unsupervised methods for speaker diarization: An integrated and iterative approach

2011

2006

Software

Framework

Link	Language	Description
SIDEKIT for diarization (s4d)	Python	An open source package extension of SIDEKIT for Speaker diarization.
pyAudioAnalysis	Python	Python Audio Analysis Library: Feature Extraction, Classification, Segmentation and Applications.
AaltoASR	Python & Perl	Speaker diarization scripts, based on AaltoASR.
LIUM SpkDiarization	Java	LIUM_SpkDiarization is a software dedicated to speaker diarization (i.e. speaker segmentation and clustering). It is written in Java, and includes the most recent developments in the domain (as of 2013).
kaldi-asr	Bash	Example scripts for speaker diarization on a portion of CALLHOME used in the 2000 NIST speaker recognition evaluation.
Alize LIA_SpkSeg	C++	ALIZÉ is an opensource platform for speaker recognition. LIA_SpkSeg is the tools for speaker diarization.
pyannote-audio	Python	Neural building blocks for speaker diarization: speech activity detection, speaker change detection, speaker embedding.
pyBK	Python	Speaker diarization using binary key speaker modelling. Computationally light solution that does not require external training data.
Speaker-Diarization	Python	Speaker diarization using uis-rnn and GhostVLAD. An easier way to support openset speakers.
EEND	Python & Bash & Perl	End-to-End Neural Diarization.
VBDiarization	Python	Speaker diarization based on Kaldi x-vectors using pretrained model trained in Kaldi (kaldi-asr/kaldi) and converted to ONNX format (onnx/onnx) running in ONNXRuntime (Microsoft/onnxruntime).
RE-VERB	Python & JavaScript	RE: VERB is speaker diarization system, it allows the user to send/record audio of a conversation and receive timestamps of who spoke when.

Evaluation

Link	Language	Description
pyannote-metrics	Python	A toolkit for reproducible evaluation, diagnostic, and error analysis of speaker diarization systems.
SimpleDER	Python	A lightweight library to compute Diarization Error Rate (DER).
NIST md-eval	Perl	(1) modified md-eval.pl from Mary Tai Knox; (2) md-eval-v21.pl from jitendra; (3) md-eval-22.pl from nryant
dscore	Python & Perl	Diarization scoring tools.
Sequence Match Accuracy	Python	Match the accuracy of two sequences with Hungarian algorithm.

Clustering

Link	Language	Description
uis-rnn	Python & PyTorch	Google's Unbounded Interleaved-State Recurrent Neural Network (UIS-RNN) algorithm, for Fully Supervised Speaker Diarization. This clustering algorithm is supervised.
uis-rnn-sml	Python & PyTorch	A variant of UIS-RNN, for the paper Supervised Online Diarization with Sample Mean Loss for Multi-Domain Data.
DNC	Python & ESPnet	Transformer-based Discriminative Neural Clustering (DNC) for Speaker Diarisation. Like UIS-RNN, it is supervised.
SpectralCluster	Python	Spectral clustering with affinity matrix refinement operations.
sklearn.cluster	Python	scikit-learn clustering algorithms.
PLDA	Python	Probabilistic Linear Discriminant Analysis & classification, written in Python.
PLDA	C++	Open-source implementation of simplified PLDA (Probabilistic Linear Discriminant Analysis).

Speaker embedding

Link	Method	Language	Description
resemble-ai/Resemblyzer	d-vector	Python & PyTorch	PyTorch implementation of generalized end-to-end loss for speaker verification, which can be used for voice cloning and diarization.
Speaker_Verification	d-vector	Python & TensorFlow	Tensorflow implementation of generalized end-to-end loss for speaker verification.
PyTorch_Speaker_Verification	d-vector	Python & PyTorch	PyTorch implementation of "Generalized End-to-End Loss for Speaker Verification" by Wan, Li et al. With UIS-RNN integration.
Real-Time Voice Cloning	d-vector	Python & PyTorch	Implementation of "Transfer Learning from Speaker Verification to Multispeaker Text-To-Speech Synthesis" (SV2TTS) with a vocoder that works in real-time.
deep-speaker	d-vector	Python & Keras	Third party implementation of the Baidu paper Deep Speaker: an End-to-End Neural Speaker Embedding System.
x-vector-kaldi-tf	x-vector	Python & TensorFlow & Perl	Tensorflow implementation of x-vector topology on top of Kaldi recipe.
kaldi-ivector	i-vector	C++ & Perl	Extension to Kaldi implementing the standard i-vector hyperparameter estimation and i-vector extraction procedure.
voxceleb-ivector	i-vector	Perl	Voxceleb1 i-vector based speaker recognition system.

Speaker change detection

Link	Language	Description
change_detection	Python & Keras	Code for Speaker Change Detection in Broadcast TV using Bidirectional Long Short-Term Memory Networks.

Other software

Link	Language	Description
VB Diarization	Python	VB Diarization with Eigenvoice and HMM Priors.

Datasets

Diarization datasets

Audio	Diarization ground truth	Language	Pricing	Additional information
2000 NIST Speaker Recognition Evaluation	Disk-6 (Switchboard), Disk-8 (CALLHOME)	Multiple	$2400.00	Evaluation Plan
2003 NIST Rich Transcription Evaluation Data	Together with audios	en, ar, zh	$2000.00	telephone speech, broadcast news
CALLHOME American English Speech	CALLHOME American English Transcripts	en	$1500.00 + $1000.00	CH109 whitelist
The ICSI Meeting Corpus	Together with audios	en	Free	License
The AMI Meeting Corpus	Together with audios (need to be processed)	Multiple	Free	License
Fisher English Training Speech Part 1 Speech	Fisher English Training Speech Part 1 Transcripts	en	$7000.00 + $1000.00
Fisher English Training Part 2, Speech	Fisher English Training Part 2, Transcripts	en	$7000.00 + $1000.00

Speaker embedding training sets

Name	Utterances	Speakers	Language	Pricing	Additional information
TIMIT	6K+	630	en	$250.00	Published in 1993, the TIMIT corpus of read speech is one of the earliest speaker recognition datasets.
VCTK	43K+	109	en	Free	Most were selected from a newspaper plus the Rainbow Passage and an elicitation paragraph intended to identify the speaker's accent.
LibriSpeech	292K	2K+	en	Free	Large-scale (1000 hours) corpus of read English speech.
LibriVox	180K	9K+	Multiple	Free	Free public domain audiobooks. LibriSpeech is a processed subset of LibriVox. Each original unsegmented utterance could be very long.
VoxCeleb 1&2	1M+	7K	Multiple	Free	VoxCeleb is an audio-visual dataset consisting of short clips of human speech, extracted from interview videos uploaded to YouTube.
The Spoken Wikipedia Corpora	5K	879	en, de, nl	Free	Volunteer readers reading Wikipedia articles.
CN-Celeb	130K+	1K	zh	Free	A Free Chinese Speaker Recognition Corpus Released by CSLT@Tsinghua University.
DeepMine	540K	1850	fa, en	Unknown	A speech database in Persian and English designed to build and evaluate speaker verification, as well as Persian ASR systems.

Leaderboards

Other learning materials

Tech blogs

Video tutorials

Google's Diarization System: Speaker Diarization with LSTM by Google
Fully Supervised Speaker Diarization: Say Goodbye to clustering by Google
Speaker Diarization: Optimal Clustering and Learning Speaker Embeddings by Microsoft Research
Robust Speaker Diarization for Meetings: the ICSI system by Microsoft Research

Products

Company	Product
Google	Google Cloud Speech-to-Text API
Amazon	Amazon Transcribe
IBM	Watson Speech To Text API
DeepAffects	Speaker Diarization API

yuelupenbgpeng123/awesome-diarization