This repository contains the Cog definition files for the associated speaker diarization model deployed on Replicate.
This model receives an audio file and identifies the individual speakers within the recording. The output is a list of annotated speech segments, along with global information about the number of detected speakers and an embedding vector for each speaker to describe the quality of his/her voice.
The model is based on a pre-trained speaker diarization pipeline from the pyannote.audio
package, with a post-processing layer that cleans up the output segments and computes input-wide speaker embeddings.
pyannote.audio
is an open-source toolkit written in Python for speaker diarization based on PyTorch. It provides a set of trainable end-to-end neural building blocks that can be combined and jointly optimized to build speaker diarization pipelines.
The main pipeline makes use of:
pyannote/segmentation
for permutation-invariant speaker segmentation on temporal slicesspeechbrain/spkrec-ecapa-voxceleb
for generating speaker embeddingsAgglomerativeClustering
for matching embeddings across temporal slices
See this post (written by the pyannote.audio
author) for more details.
Starting from version 64b78c82
, the model now uses ffmpeg
to decode the input audio, so it supports a wide variety of input formats - including, but not limited to mp3
, aac
, flac
, ogg
, opus
, wav
.
The model outputs a single output.json
file with the following structure:
{
"segments": [
{
"speaker": "A",
"start": "0:00:00.497812",
"stop": "0:00:09.779063"
},
{
"speaker": "B",
"start": "0:00:09.863438",
"stop": "0:03:34.962188"
}
],
"speakers": {
"count": 2,
"labels": [
"A",
"B"
],
"embeddings": {
"A": [<array of 192 floats>],
"B": [<array of 192 floats>]
}
}
}
The current T4 deployment has an average processing speed factor of 12x (relative to the length of the audio input) - e.g. it will take the model approx. 1 minute of computation to process 12 minutes of audio.
Data augmentation and segmentation for a variety of transcription and captioning tasks (e.g. interviews, podcasts, meeting recordings, etc.). Speaker recognition can be implemented by matching the speaker embeddings against a database of known speakers.
This model may have biases based on the data it has been trained on. It is important to use the model in a responsible manner and adhere to ethical and legal standards.
If you use pyannote.audio
please use the following citations:
@inproceedings{Bredin2020,
Title = {{pyannote.audio: neural building blocks for speaker diarization}},
Author = {{Bredin}, Herv{\'e} and {Yin}, Ruiqing and {Coria}, Juan Manuel and {Gelly}, Gregory and {Korshunov}, Pavel and {Lavechin}, Marvin and {Fustes}, Diego and {Titeux}, Hadrien and {Bouaziz}, Wassim and {Gill}, Marie-Philippe},
Booktitle = {ICASSP 2020, IEEE International Conference on Acoustics, Speech, and Signal Processing},
Year = {2020},
}
@inproceedings{Bredin2021,
Title = {{End-to-end speaker segmentation for overlap-aware resegmentation}},
Author = {{Bredin}, Herv{\'e} and {Laurent}, Antoine},
Booktitle = {Proc. Interspeech 2021},
Year = {2021},
}