To install all dependencies, run
python -m pip install pip==24.0
pip3 install -r requirements.txt
The word error rate (WER) and speaker similarity (SIM) metrics are adopted for objective evaluation.
- For WER, we employ Whisper-large-v3 and Paraformer-zh as the automatic speech recognition (ASR) engines for English and Mandarin, respectively.
- For SIM, we use WavLM-large fine-tuned on the speaker verification task (model link) to obtain speaker embeddings used to calculate the cosine similarity of speech samples of each test utterance against reference clips.
- For MCD, we use py_mcd
- For PESQ, we use pesq
- For STOI and ESTOI, we use pystoi
Meta-data format:
ground_truth_path1'\t'prompt_path1'\t'ground_truth_text1
ground_truth_path2'\t'prompt_path2'\t'ground_truth_text2
ground_truth_path3'\t'prompt_path3'\t'ground_truth_text3
bash cal_wer.sh {--meta_lst the path of the meta file} {--synthesized_dir the directory of synthesized audio} {--prompt_dir the directory of prompt audio} {--ground_truth_dir the directory of ground truth audio} {--lang: zh or en} {--generated_wav_suffix the suffix of the generated speech}
bash cal_sim.sh {--meta_lst the path of the meta file} {--synthesized_dir the directory of synthesized audio} {--prompt_dir the directory of prompt audio} {--ground_truth_dir the directory of ground truth audio} {--generated_wav_suffix the suffix of the generated speech} {--checkpoint path/wavlm_large_finetune.pth}
bash cal_mcd.sh {--meta_lst the path of the meta file} {--synthesized_dir the directory of synthesized audio} {--prompt_dir the directory of prompt audio} {--ground_truth_dir the directory of ground truth audio} {--generated_wav_suffix the suffix of the generated speech}
bash cal_recons_metrics.sh {--meta_lst the path of the meta file} {--synthesized_dir the directory of synthesized audio} {--prompt_dir the directory of prompt audio} {--ground_truth_dir the directory of ground truth audio} {--generated_wav_suffix the suffix of the generated speech}
This repo is inspired by and borrows many codes from Seed-TTS-Eval and UniSpeech. Thank you to the authors of these projects for their contributions!