Speaker diarization based on python implementation from http://voicebiometry.org/
Dependencies are listed in requirements.txt
.
It is recommended to use anaconda environment https://www.anaconda.com/download/ because of mkl based implementation.
Run python setup.py install
Config file declares used models and relative path to them. Preferred configuration file is configs/vbdiar.yml
.
Pretrained models are stored in models/
directory. It is possible to score even without PLDA model, see config file configs/vbdiar_no-PLDA.yml
.
Example script examples/diarization.py
is able to run full diarization process.
'-l', '--input-list'
- specifies relative path to files for testing, it is possible to specify number of speakers as the second column. Do not use file suffixes, path is always relative to input directory and suffix.
'-c', '--configuration'
- specifies configuration file
'--audio-dir'
- directory with audio files in .wav
format - 8000Hz, 16bit-s, 1c
.
'--vad-dir'
- directory with lab files - Voice/Speech activity detection - format speech_start speech_end
.
'--in-ivec-dir'
- input directory containing i-vectors (if they were previously saved).
'--out-ivec-dir'
- output directory for storing i-vectors.
'--norm-list'
- input list with files for score normalization. When performing score normalization, it is necessary to use input ground truth .rttm
files with unique speaker label. Speaker labels should not overlap, only in case, that there is same speaker in more audio files. All normalization utterances will be merged by speaker labels.
'--in-rttm-dir'
- input directory with .rttm
files (used primary for score normalization)
'--out-rttm-dir'
- output directory for storing .rttm
files
'--min-window-size'
- minimal size of i-vector window in miliseconds. Defines minimal size used for clustering algorithms.
'--max-window-size'
- maximal size of i-vector window in miliseconds.
'--vad-tolerance'
- skip n
frames of non-speech and merge them as speech.
'--max-num-speakers'
- maximal number of speakers. Used in clustering algorithm.
AMI corpus http://groups.inf.ed.ac.uk/ami/corpus/ (development and evaluation set)
System | DER |
---|---|
v64 + PLDA + Oracle number of speakers | 17.47 |
v64 + PLDA + Oracle number of speakers + S-Norm | 16.26 |
v64 + PLDA + S-Norm | 16.08 |
v64 + Cosine Scoring + S-Norm | 15.81 |