This software allows you to calibrate log likelihood ratio (LLR) scores for speaker verification evaluation.
Often, speaker verification systems are evaluated on the actual DCF scores or CLLR scores. In order to get good performance on these measures, LLR scores from a speaker verification system (e.g., from a PLDA model) need to be calibrated.
This software allows one to optimize the CLLR measure of a speaker verification system (or a combination of systems). Calibration is done by finding a linear transform that optimizes the CLLR measure of the heldout data.
- Python 3+
- Pytorch 1.0 (no GPU needed)
In the samples
directory, there are thee files: sys1_llr.txt
,
sys2_llr.txt
and trial-keys.txt
. The first two are (uncalibrated)
LLR scores from two different systems on a heldout trial set, and the third file
gives the oracle values for all trials -- target (tgt
) or non-target/impostor (imp
).
In order to measure the DCF and CLLR scores, you need to download and extract a scoring tool from https://app.box.com/s/9tpuuycgxk9hykr6romsv05vvmdpie11/file/389271165078. It's an official scorer for the The VOiCES from a Distance Challenge.
Let's first measure the accuracy of the uncalibrated scores:
$ python2 voices_scorer/score_voices sample/sys1_llr.txt sample/trial-keys.txt
minDCF : 0.4252
actDCF : 0.7496
avgRPrec : 0.6384
EER : 0.0547
Cllr : 0.9787
$ python2 voices_scorer/score_voices sample/sys2_llr.txt sample/trial-keys.txt
minDCF : 0.3034
actDCF : 1.8849
avgRPrec : 0.6979
EER : 0.0710
Cllr : 0.5986
The software can calibrate scores of one or more systems. Let's first try
to calibrate the 1st system (sample/sys2_llr.txt
). First, you have to find
the parameters that optimize CLLR of heldout data:
$ python calibrate_scores.py --save-model sample/sys1_calibration.pth sample/trial-keys.txt sample/sys1_llr.txt
Starting point for CLLR is 0.978737
STEP: 0
loss: 0.5246010594024472
[...]
loss: 0.18544731777635964
Converged!
Saving model to sample/sys1_calibration.pth
Next, you need to apply the calibration model:
$ python apply_calibration.py sample/sys1_calibration.pth sample/sys1_llr.txt sample/sys1_calibrated_llr.txt
Let's measure the performance of the calibrated system:
$ python2 voices_scorer/score_voices sample/sys1_calibrated_llr.txt sample/trial-keys.txt
minDCF : 0.4252
actDCF : 0.4320
avgRPrec : 0.6384
EER : 0.0547
Cllr : 0.1854
As can be seen, the actDCF
and Cllr
scores are now much better than initially.
You can also calibrate a fusion of two or more systems:
$ python calibrate_scores.py --save-model sample/sys1_sys2_calibration.pth sample/trial-keys.txt sample/sys1_llr.txt sample/sys2_llr.txt
Starting point for CLLR is 0.788658
STEP: 0
loss: 0.711224191738577
loss: 0.7045015511238044
[...]
loss: 0.18383203478911536
loss: 0.18382984498508542
Converged!
Saving model to sample/sys1_sys2_calibration.pth
Apply the model:
$ python apply_calibration.py sample/sys1_sys2_calibration.pth sample/sys1_llr.txt sample/sys2_llr.txt sample/sys1_sys2_calibrated_llr.txt
Measure the performance:
$ python2 voices_scorer/score_voices sample/sys1_sys2_calibrated_llr.txt sample/trial-keys.txt
minDCF : 0.3516
actDCF : 0.3586
avgRPrec : 0.6592
EER : 0.0533
Cllr : 0.1838
You can cite the following paper if you use the software in research:
@inproceedings{alumae2019taltech,
author={Tanel Alum\"{a}e, Asadullah},
title={The {TalTech} Systems for the {VOiCES from a Distance Challenge}},
year=2019,
booktitle={Interspeech (submitted)},
}