thewh1teagle/pyannote-rs

pyannote audio diarization in rust

RustMIT

pyannote-rs

Pyannote audio diarization in Rust

Features

Compute 1 hour of audio in less than a minute on CPU.
Faster performance with DirectML on Windows and CoreML on macOS.
Accurate timestamps with Pyannote segmentation.
Identify speakers with wespeaker embeddings.

Install

cargo add pyannote-rs

Usage

Examples

How it works

pyannote-rs uses 2 models for speaker diarization:

Segmentation: segmentation-3.0 identifies when speech occurs.
Speaker Identification: wespeaker-voxceleb-resnet34-LM identifies who is speaking.

Inference is powered by onnxruntime.

The segmentation model processes up to 10s of audio, using a sliding window approach (iterating in chunks).
The embedding model processes filter banks (audio features) extracted with knf-rs.

Speaker comparison (e.g., determining if Alice spoke again) is done using cosine similarity.

Credits

Big thanks to pyannote-onnx and kaldi-native-fbank