Inference

Question

Inference

voicegen opened this issue a year ago · 7 comments

Hello, during the inference phase, do I only need to use the 886 audio files from your data/test_audiocaps_subset.json? I have been unable to obtain the results from your paper, even when using your checkpoint.

Answer 1 · 2023-07-10T05:44:48.000Z

Yes, we used those 886 audio files for evaluation. Can you specify which checkpoint you used and which results you were not able to obtain?

Answer 2 · 2023-07-10T06:20:51.000Z

Yes, we used those 886 audio files for evaluation. Can you specify which checkpoint you used and which results you were not able to obtain?

I use https://huggingface.co/declare-lab/tango to generate 886 audio files, and use Guidance Scale=3 Steps=200，and get {"frechet_distance": 28.07995041974766, "frechet_audio_distance": 2.2381015516014955, "kullback_leibler_divergence_sigmoid": 3.8415958881378174, "kullback_leibler_divergence_softmax": 2.097446918487549, "lsd": 2.0631229603209094, "psnr": 15.874651663776682, "ssim": 0.4171875863485156, "ssim_stft": 0.09866382013407798, "inception_score_mean": 7.612150196882789, "inception_score_std": 0.8235111705490618, "kernel_inception_distance_mean": 0.010067609062191894, "kernel_inception_distance_std": 1.404596756557554e-07}

Answer 3 · 2023-07-10T06:26:40.000Z

Do I need to control the length of the generated audio to be the same as the original audio length to adjust its metrics.

Answer 4 · 2023-07-11T05:13:05.000Z

No, the length doesn't have to be controlled.

I added the inference_hf.py script for running evaluation from our huggingface checkpoints. Can you try and check the scores you obtain from this script?

I just did two runs and got the following scores:

{
    "frechet_distance": 24.4243,
    "frechet_audio_distance": 1.7324,
    "kl_sigmoid": 3.5901,
    "kl_softmax": 1.3216,
    "lsd": 2.0861,
    "psnr": 15.6047,
    "ssim": 0.4061,
    "ssim_stft": 0.1027,
    "is_mean": 7.5181,
    "is_std": 0.6758,
    "kid_mean": 0.0066,
    "kid_std": 0.0,
    "Steps": 200,
    "Guidance Scale": 3,
    "Test Instances": 886,
    "scheduler_config": {
        "num_train_timesteps": 1000,
        "beta_start": 0.00085,
        "beta_end": 0.012,
        "beta_schedule": "scaled_linear",
        "trained_betas": null,
        "variance_type": "fixed_small",
        "clip_sample": false,
        "prediction_type": "v_prediction",
        "thresholding": false,
        "dynamic_thresholding_ratio": 0.995,
        "clip_sample_range": 1.0,
        "sample_max_value": 1.0,
        "_class_name": "DDIMScheduler",
        "_diffusers_version": "0.8.0",
        "set_alpha_to_one": false,
        "skip_prk_steps": true,
        "steps_offset": 1
    },
    "args": {
        "test_file": "data/test_audiocaps_subset.json",
        "text_key": "captions",
        "device": "cuda:0",
        "test_references": "data/audiocaps_test_references/subset",
        "num_steps": 200,
        "guidance": 3,
        "batch_size": 8,
        "num_test_instances": -1
    },
    "output_dir": "outputs/1688974057_steps_200_guidance_3"
}

{
    "frechet_distance": 24.9405,
    "frechet_audio_distance": 1.6633,
    "kl_sigmoid": 3.551,
    "kl_softmax": 1.3122,
    "lsd": 2.0957,
    "psnr": 15.5877,
    "ssim": 0.405,
    "ssim_stft": 0.1027,
    "is_mean": 7.187,
    "is_std": 0.5192,
    "kid_mean": 0.0066,
    "kid_std": 0.0,
    "Steps": 200,
    "Guidance Scale": 3,
    "Test Instances": 886,
    "scheduler_config": {
        "num_train_timesteps": 1000,
        "beta_start": 0.00085,
        "beta_end": 0.012,
        "beta_schedule": "scaled_linear",
        "trained_betas": null,
        "variance_type": "fixed_small",
        "clip_sample": false,
        "prediction_type": "v_prediction",
        "thresholding": false,
        "dynamic_thresholding_ratio": 0.995,
        "clip_sample_range": 1.0,
        "sample_max_value": 1.0,
        "_class_name": "DDIMScheduler",
        "_diffusers_version": "0.8.0",
        "set_alpha_to_one": false,
        "skip_prk_steps": true,
        "steps_offset": 1
    },
    "args": {
        "test_file": "data/test_audiocaps_subset.json",
        "text_key": "captions",
        "device": "cuda:3",
        "test_references": "data/audiocaps_test_references/subset",
        "num_steps": 200,
        "guidance": 3,
        "batch_size": 8,
        "num_test_instances": -1
    },
    "output_dir": "outputs/1688974524_steps_200_guidance_3"
}

Our results in the paper are average of multiple runs as there are some randomness in the diffusion inference process.

Answer 5 · 2023-07-11T07:49:54.000Z

Thank you for explaination.

Answer 6 · 2023-07-11T09:46:33.000Z

No, the length doesn't have to be controlled.

I added the inference_hf.py script for running evaluation from our huggingface checkpoints. Can you try and check the scores you obtain from this script?

I just did two runs and got the following scores:

{
    "frechet_distance": 24.4243,
    "frechet_audio_distance": 1.7324,
    "kl_sigmoid": 3.5901,
    "kl_softmax": 1.3216,
    "lsd": 2.0861,
    "psnr": 15.6047,
    "ssim": 0.4061,
    "ssim_stft": 0.1027,
    "is_mean": 7.5181,
    "is_std": 0.6758,
    "kid_mean": 0.0066,
    "kid_std": 0.0,
    "Steps": 200,
    "Guidance Scale": 3,
    "Test Instances": 886,
    "scheduler_config": {
        "num_train_timesteps": 1000,
        "beta_start": 0.00085,
        "beta_end": 0.012,
        "beta_schedule": "scaled_linear",
        "trained_betas": null,
        "variance_type": "fixed_small",
        "clip_sample": false,
        "prediction_type": "v_prediction",
        "thresholding": false,
        "dynamic_thresholding_ratio": 0.995,
        "clip_sample_range": 1.0,
        "sample_max_value": 1.0,
        "_class_name": "DDIMScheduler",
        "_diffusers_version": "0.8.0",
        "set_alpha_to_one": false,
        "skip_prk_steps": true,
        "steps_offset": 1
    },
    "args": {
        "test_file": "data/test_audiocaps_subset.json",
        "text_key": "captions",
        "device": "cuda:0",
        "test_references": "data/audiocaps_test_references/subset",
        "num_steps": 200,
        "guidance": 3,
        "batch_size": 8,
        "num_test_instances": -1
    },
    "output_dir": "outputs/1688974057_steps_200_guidance_3"
}

{
    "frechet_distance": 24.9405,
    "frechet_audio_distance": 1.6633,
    "kl_sigmoid": 3.551,
    "kl_softmax": 1.3122,
    "lsd": 2.0957,
    "psnr": 15.5877,
    "ssim": 0.405,
    "ssim_stft": 0.1027,
    "is_mean": 7.187,
    "is_std": 0.5192,
    "kid_mean": 0.0066,
    "kid_std": 0.0,
    "Steps": 200,
    "Guidance Scale": 3,
    "Test Instances": 886,
    "scheduler_config": {
        "num_train_timesteps": 1000,
        "beta_start": 0.00085,
        "beta_end": 0.012,
        "beta_schedule": "scaled_linear",
        "trained_betas": null,
        "variance_type": "fixed_small",
        "clip_sample": false,
        "prediction_type": "v_prediction",
        "thresholding": false,
        "dynamic_thresholding_ratio": 0.995,
        "clip_sample_range": 1.0,
        "sample_max_value": 1.0,
        "_class_name": "DDIMScheduler",
        "_diffusers_version": "0.8.0",
        "set_alpha_to_one": false,
        "skip_prk_steps": true,
        "steps_offset": 1
    },
    "args": {
        "test_file": "data/test_audiocaps_subset.json",
        "text_key": "captions",
        "device": "cuda:3",
        "test_references": "data/audiocaps_test_references/subset",
        "num_steps": 200,
        "guidance": 3,
        "batch_size": 8,
        "num_test_instances": -1
    },
    "output_dir": "outputs/1688974524_steps_200_guidance_3"
}

Our results in the paper are average of multiple runs as there are some randomness in the diffusion inference process.

I found that the sampling rate of the reference audio has an impact on the final result. I would like to ask about the sampling rate of your reference audio before coverting to 16k Hz.

Answer 7 · 2023-07-15T03:23:18.000Z

All our reference audio files are in 16 KHz.

I checked the AudioLDM Eval repository, and they now mention that the sampling rate can have an effect on the evaluation scores.

Their paper and evaluation code indicate that their scores are reported for 16 KHz. So we also report results with the same sampling rate for a fair comparison.