VERSA (Versatile Evaluation of Speech and Audio) is a toolkit dedicating a collection of evaluation metrics in speech and audio quality. Our goal is to provide comprehensive connection to the cutting-edge techniques developped for evaluation. The toolkit is also tightly integrated to ESPnet.
The base-installation is as easy as follows:
git clone https://github.com/shinjiwlab/versa.git
cd versa
pip install .
or
pip install git+https://github.com/shinjiwlab/versa.git
As for collection purpose, VERSA instead of re-distribu the model, we try to align as much to the original API provided by the algorithm developer. Therefore, we are having many dependencies. We try to include as many as default, but there are cases where the toolkit needs specific installtion requirements. Please refer to our list-of-metric section for more details on whether the metrics are automatically included or not. If not, we provide installation guide or installers in tools
.
python versa/test/test_script.py
Simple usage case for a few samples.
# direct usage
python versa/bin/scorer.py \
--score_config egs/codec_16k.yaml \
--gt test/test1 \
--pred test/test2 \
--output_file test_result
Use launcher with slurm job submissions
# use the launcher
# Option1: with gt speech
./launch.sh \
<pred_speech_scp> \
<gt_speech_scp> \
<score_dir> \
<split_job_num>
# Option2: without gt speech
./launch.sh \
<pred_speech_scp> \
None \
<score_dir> \
<split_job_num>
# aggregate the results
cat <score_dir>/result/*.result.cpu.txt > <score_dir>/utt_result.cpu.txt
cat <score_dir>/result/*.result.gpu.txt > <score_dir>/utt_result.gpu.txt
# show result
python scripts/show_result.py <score_dir>/utt_result.cpu.txt
python scripts/show_result.py <score_dir>/utt_result.gpu.txt
Access egs/*.yaml
for different config for differnt setups.
We include [ ] and [x] to mark if the metirc is auto-installed in versa.
Metric Name (Auto-Install) | Key in config | Key in report | Details | Code Source | References |
---|---|---|---|---|---|
Mel Cepstral Distortion (MCD) [x] | mcd_f0 | mcd | espnet and s3prl-vc | https://ieeexplore.ieee.org/iel2/3220/9154/00407206.pdf | |
F0 Correlation [x] | mcd_f0 | f0_corr | espnet and s3prl-vc | https://ieeexplore.ieee.org/iel7/9040208/9052899/09053512.pdf | |
F0 Root Mean Square Error [x] | mcd_f0 | f0_rmse | espnet and s3prl-vc | https://ieeexplore.ieee.org/iel7/9040208/9052899/09053512.pdf | |
Signal-to-infererence Ratio (SIR) [x] | signal_metric | sir | espnet | - | |
Signal-to-artifact Ratio (SAR) [x] | signal_metric | sar | espnet | - | |
Signal-to-distortion Ratio (SDR) [x] | signal_metric | sdr | espnet | - | |
Convolutional scale-invariant signal-to-distortion ratio (CI-SDR) [x] | signal_metric | ci-sdr | ci_sdr | https://arxiv.org/abs/2011.15003 | |
Scale-invariant signal-to-noise ratio (SI-SNR) [x] | signal_metric | si-snr | espnet | https://arxiv.org/abs/1711.00541 | |
Perceptual Evaluation of Speech Quality (PESQ) [x] | pesq | pesq | pesq | https://ieeexplore.ieee.org/document/941023 | |
Short-Time Objective Intelligibility (STOI) [x] | stoi | stoi | pystoi | https://ieeexplore.ieee.org/document/5495701 | |
Speech BERT Score [x] | discrete_speech | speech_bert | discrete speech metric | https://arxiv.org/abs/2401.16812 | |
Discrete Speech BLEU Score [x] | discrete_speech | speech_belu | discrete speech metric | https://arxiv.org/abs/2401.16812 | |
Discrete Speech Token Edit Distance [x] | discrete_speech | speech_token_distance | discrete speech metric | https://arxiv.org/abs/2401.16812 | |
UTokyo-SaruLab System for VoiceMOS Challenge 2022 (UTMOS) [x] | pseudo_mos | utmos | speechmos | https://arxiv.org/abs/2204.02152 | |
Deep Noise Suppression MOS Score of P.835 (DNSMOS) [x] | pseudo_mos | dnsmos_overall | speechmos (MS) | https://arxiv.org/abs/2110.01763 | |
Deep Noise Suppression MOS Score of P.808 (DNSMOS) [x] | pseudo_mos | dnsmos_p808 | speechmos (MS) | https://arxiv.org/abs/2005.08138 | |
Packet Loss Concealment-related MOS Score (PLCMOS) [x] | pseudo_mos | plcmos | speechmos (MS) | https://arxiv.org/abs/2305.15127 | |
Virtual Speech Quality Objective Listener (VISQOL) [ ] | visqol | visqol | google-visqol | https://arxiv.org/abs/2004.09584 | |
Speaker Embedding Similarity [x] | speaker | spk_similarity | espnet | https://arxiv.org/abs/2401.17230 | |
PESQ in TorchAudio-Squim [x] | squim | torch_squim_pesq | torch_squim | https://arxiv.org/abs/2304.01448 | |
STOI in TorchAudio-Squim [x] | squim | torch_squim_stoi | torch_squim | https://arxiv.org/abs/2304.01448 | |
SI-SDR in TorchAudio-Squim [x] | squim | torch_squim_si_sdr | torch_squim | https://arxiv.org/abs/2304.01448 | |
MOS in TorchAudio-Squim [x] | squim | torch_squim_mos | torch_squim | https://arxiv.org/abs/2304.01448 |