/pyannote-audio

Neural building blocks for speaker diarization: speech activity detection, speaker change detection, overlapped speech detection, speaker embedding

Primary LanguagePythonMIT LicenseMIT

Neural speaker diarization with pyannote.audio

pyannote.audio is an open-source toolkit written in Python for speaker diarization. Based on PyTorch machine learning framework, it provides a set of trainable end-to-end neural building blocks that can be combined and jointly optimized to build speaker diarization pipelines.

TL;DR Open In Colab

# instantiate pretrained speaker diarization pipeline
from pyannote.audio import Pipeline
pipeline = Pipeline.from_pretrained("pyannote/speaker-diarization")

# apply pretrained pipeline
diarization = pipeline("audio.wav")

# print the result
for turn, _, speaker in diarization.itertracks(yield_label=True):
    print(f"start={turn.start:.1f}s stop={turn.end:.1f}s speaker_{speaker}")
# start=0.2s stop=1.5s speaker_A
# start=1.8s stop=3.9s speaker_B
# start=4.2s stop=5.7s speaker_A
# ...

What's new in pyannote.audio 2.0

For version 2.0 of pyannote.audio, I decided to rewrite almost everything from scratch.
Highlights of this release are:

Installation

Only Python 3.8+ is officially supported (though it might work with Python 3.7)

conda create -n pyannote python=3.8
conda activate pyannote
conda install pytorch torchaudio -c pytorch
pip install https://github.com/pyannote/pyannote-audio/archive/develop.zip

Documentation

Frequently asked questions

Pretrained pipelines do not produce good results on my data. What can I do?

  1. Annotate dozens of conversations manually and separate them into development and test subsets in pyannote.database.
  2. Optimize the hyper-parameters of the pretained pipeline using the development set. If performance is still not good enough, go to step 3.
  3. Annotate hundreds of conversations manually and set them up as training subset in pyannote.database.
  4. Fine-tune the models (on which the pipeline relies) using the training set.
  5. Optimize the hyper-parameters of the pipeline using the fine-tuned models using the development set. If performance is still not good enough, go back to step 3.

Benchmark

The pretrained speaker diarization pipeline with default parameters is expected to be much better in v2.0 than in v1.1:

Diarization error rate (%) v1.1 v2.0 ∆DER
AMI only_words evaluation set 29.7 21.5 -28%
DIHARD 3 evaluation set 29.2 22.2 -23%
VoxConverse 0.0.2 evaluation set 21.5 12.8 -40%

Here is the (pseudo-)code used to obtain those numbers:

# v1.1
import torch
pipeline = torch.hub.load("pyannote/pyannote-audio", "dia")
diarization = pipeline({"audio": "audio.wav"})

# v2.0
from pyannote.audio import Pipeline
pipeline = Pipeline.from_pretrained("pyannote/speaker-diarization")
diarization = pipeline("audio.wav")

# evaluation
from pyannote.metrics.diarization import DiarizationErrorRate
metric = DiarizationErrorRate(collar=0.0, skip_overlap=False)
for audio, reference in evaluation_set:  # pseudo-code
    diarization = pipeline(audio)
    _ = metric(reference, diarization)
der = abs(metric)

Support

For commercial enquiries and scientific consulting, please contact me.

Development

The commands below will setup pre-commit hooks and packages needed for developing the pyannote.audio library.

pip install -e .[dev,testing]
pre-commit install

Tests rely on a set of debugging files available in test/data directory. Set PYANNOTE_DATABASE_CONFIG environment variable to test/data/database.yml before running tests:

PYANNOTE_DATABASE_CONFIG=tests/data/database.yml pytest