marcoppasini/MelGAN-VC

Would you please explain the inference process and what does GRAD do?

Closed this issue · 3 comments

Hi, I read the code, but when it comes to the inference part, I get confused about two points:

  1. Only 1 wav file is fed to the input of inference process. In my opinion, there should be two input wav files - one is source and the other is target. Would you please explain this?
  2. GRAD is really hard to understand, please give me some guidance.
    Looking forward to your reply. Thank you!

Hi!

  1. You must train the model with only one target style domain (voice or music genre), so at inference you don't need to feed a target sample to the model.
  2. the GRAD function is a gradient-based method to turn spectrograms back to waveform, which I find to work better that the traditional Griffin-Lim algorithm. You can read the cited paper for more specific info.

Hi!
Have you tried the wavenet vocoder when you turn spectrograms back to waveform?

@CarolinGao hi, I didn't follow this repo any further since this issue, although the MelGAN-VC works well for one to one VC. I need any to many or any to any VC, which is more applicable.