Inference
voicegen opened this issue · 7 comments
Hello, during the inference phase, do I only need to use the 886 audio files from your data/test_audiocaps_subset.json? I have been unable to obtain the results from your paper, even when using your checkpoint.
Yes, we used those 886 audio files for evaluation. Can you specify which checkpoint you used and which results you were not able to obtain?
Yes, we used those 886 audio files for evaluation. Can you specify which checkpoint you used and which results you were not able to obtain?
I use https://huggingface.co/declare-lab/tango to generate 886 audio files, and use Guidance Scale=3 Steps=200,and get {"frechet_distance": 28.07995041974766, "frechet_audio_distance": 2.2381015516014955, "kullback_leibler_divergence_sigmoid": 3.8415958881378174, "kullback_leibler_divergence_softmax": 2.097446918487549, "lsd": 2.0631229603209094, "psnr": 15.874651663776682, "ssim": 0.4171875863485156, "ssim_stft": 0.09866382013407798, "inception_score_mean": 7.612150196882789, "inception_score_std": 0.8235111705490618, "kernel_inception_distance_mean": 0.010067609062191894, "kernel_inception_distance_std": 1.404596756557554e-07}
Do I need to control the length of the generated audio to be the same as the original audio length to adjust its metrics.
No, the length doesn't have to be controlled.
I added the inference_hf.py script for running evaluation from our huggingface checkpoints. Can you try and check the scores you obtain from this script?
I just did two runs and got the following scores:
{
"frechet_distance": 24.4243,
"frechet_audio_distance": 1.7324,
"kl_sigmoid": 3.5901,
"kl_softmax": 1.3216,
"lsd": 2.0861,
"psnr": 15.6047,
"ssim": 0.4061,
"ssim_stft": 0.1027,
"is_mean": 7.5181,
"is_std": 0.6758,
"kid_mean": 0.0066,
"kid_std": 0.0,
"Steps": 200,
"Guidance Scale": 3,
"Test Instances": 886,
"scheduler_config": {
"num_train_timesteps": 1000,
"beta_start": 0.00085,
"beta_end": 0.012,
"beta_schedule": "scaled_linear",
"trained_betas": null,
"variance_type": "fixed_small",
"clip_sample": false,
"prediction_type": "v_prediction",
"thresholding": false,
"dynamic_thresholding_ratio": 0.995,
"clip_sample_range": 1.0,
"sample_max_value": 1.0,
"_class_name": "DDIMScheduler",
"_diffusers_version": "0.8.0",
"set_alpha_to_one": false,
"skip_prk_steps": true,
"steps_offset": 1
},
"args": {
"test_file": "data/test_audiocaps_subset.json",
"text_key": "captions",
"device": "cuda:0",
"test_references": "data/audiocaps_test_references/subset",
"num_steps": 200,
"guidance": 3,
"batch_size": 8,
"num_test_instances": -1
},
"output_dir": "outputs/1688974057_steps_200_guidance_3"
}
{
"frechet_distance": 24.9405,
"frechet_audio_distance": 1.6633,
"kl_sigmoid": 3.551,
"kl_softmax": 1.3122,
"lsd": 2.0957,
"psnr": 15.5877,
"ssim": 0.405,
"ssim_stft": 0.1027,
"is_mean": 7.187,
"is_std": 0.5192,
"kid_mean": 0.0066,
"kid_std": 0.0,
"Steps": 200,
"Guidance Scale": 3,
"Test Instances": 886,
"scheduler_config": {
"num_train_timesteps": 1000,
"beta_start": 0.00085,
"beta_end": 0.012,
"beta_schedule": "scaled_linear",
"trained_betas": null,
"variance_type": "fixed_small",
"clip_sample": false,
"prediction_type": "v_prediction",
"thresholding": false,
"dynamic_thresholding_ratio": 0.995,
"clip_sample_range": 1.0,
"sample_max_value": 1.0,
"_class_name": "DDIMScheduler",
"_diffusers_version": "0.8.0",
"set_alpha_to_one": false,
"skip_prk_steps": true,
"steps_offset": 1
},
"args": {
"test_file": "data/test_audiocaps_subset.json",
"text_key": "captions",
"device": "cuda:3",
"test_references": "data/audiocaps_test_references/subset",
"num_steps": 200,
"guidance": 3,
"batch_size": 8,
"num_test_instances": -1
},
"output_dir": "outputs/1688974524_steps_200_guidance_3"
}
Our results in the paper are average of multiple runs as there are some randomness in the diffusion inference process.
Thank you for explaination.
No, the length doesn't have to be controlled.
I added the inference_hf.py script for running evaluation from our huggingface checkpoints. Can you try and check the scores you obtain from this script?
I just did two runs and got the following scores:
{ "frechet_distance": 24.4243, "frechet_audio_distance": 1.7324, "kl_sigmoid": 3.5901, "kl_softmax": 1.3216, "lsd": 2.0861, "psnr": 15.6047, "ssim": 0.4061, "ssim_stft": 0.1027, "is_mean": 7.5181, "is_std": 0.6758, "kid_mean": 0.0066, "kid_std": 0.0, "Steps": 200, "Guidance Scale": 3, "Test Instances": 886, "scheduler_config": { "num_train_timesteps": 1000, "beta_start": 0.00085, "beta_end": 0.012, "beta_schedule": "scaled_linear", "trained_betas": null, "variance_type": "fixed_small", "clip_sample": false, "prediction_type": "v_prediction", "thresholding": false, "dynamic_thresholding_ratio": 0.995, "clip_sample_range": 1.0, "sample_max_value": 1.0, "_class_name": "DDIMScheduler", "_diffusers_version": "0.8.0", "set_alpha_to_one": false, "skip_prk_steps": true, "steps_offset": 1 }, "args": { "test_file": "data/test_audiocaps_subset.json", "text_key": "captions", "device": "cuda:0", "test_references": "data/audiocaps_test_references/subset", "num_steps": 200, "guidance": 3, "batch_size": 8, "num_test_instances": -1 }, "output_dir": "outputs/1688974057_steps_200_guidance_3" }{ "frechet_distance": 24.9405, "frechet_audio_distance": 1.6633, "kl_sigmoid": 3.551, "kl_softmax": 1.3122, "lsd": 2.0957, "psnr": 15.5877, "ssim": 0.405, "ssim_stft": 0.1027, "is_mean": 7.187, "is_std": 0.5192, "kid_mean": 0.0066, "kid_std": 0.0, "Steps": 200, "Guidance Scale": 3, "Test Instances": 886, "scheduler_config": { "num_train_timesteps": 1000, "beta_start": 0.00085, "beta_end": 0.012, "beta_schedule": "scaled_linear", "trained_betas": null, "variance_type": "fixed_small", "clip_sample": false, "prediction_type": "v_prediction", "thresholding": false, "dynamic_thresholding_ratio": 0.995, "clip_sample_range": 1.0, "sample_max_value": 1.0, "_class_name": "DDIMScheduler", "_diffusers_version": "0.8.0", "set_alpha_to_one": false, "skip_prk_steps": true, "steps_offset": 1 }, "args": { "test_file": "data/test_audiocaps_subset.json", "text_key": "captions", "device": "cuda:3", "test_references": "data/audiocaps_test_references/subset", "num_steps": 200, "guidance": 3, "batch_size": 8, "num_test_instances": -1 }, "output_dir": "outputs/1688974524_steps_200_guidance_3" }Our results in the paper are average of multiple runs as there are some randomness in the diffusion inference process.
I found that the sampling rate of the reference audio has an impact on the final result. I would like to ask about the sampling rate of your reference audio before coverting to 16k Hz.
All our reference audio files are in 16 KHz.
I checked the AudioLDM Eval repository, and they now mention that the sampling rate can have an effect on the evaluation scores.
Their paper and evaluation code indicate that their scores are reported for 16 KHz. So we also report results with the same sampling rate for a fair comparison.