seungwonpark/melgan

How to inference using MelGAN given a tacotron mel spec output?

OswaldoBornemann opened this issue · 11 comments

When i trained melgan with original wav's mel spec, the result went well.

But when i tried to feed tacotron mel spec output into trained melgan model, the sound just all bee. Would you mind sharing some advice? thanks a lot. @seungwonpark

upload sound samples?

@CookiePPP Please set the volume into lowest... I don't want to hurt your ears...

bad result.wav.zip

Do you have the code you used to feed the tacotron outputs into melgan uploaded somewhere?
That's definitely bugged out.

@CookiePPP The process are kind like below:

First i get the mel spec output from tacotron, using like

# mel sent shape is (spec_length, 80)
mel_sent = tacotron_out(model, sentence, CONFIG, use_cuda, ap, use_gl=use_gl, figures=True)

Then i unsqueeze and transpose the mel result to feed into MelGAN.

checkpoint_path = "./melgan/chkpt/id_test1/id_test1_aca5990_0700.pt"
config = "./melgan/config/id_test1.yaml"

checkpoint = torch.load(checkpoint_path)
# if args.config is not None:
#     hp = HParam(config)
# else:
hp = load_hparam_str(checkpoint['hp_str'])

melgan_model = Generator(hp.audio.n_mel_channels).cuda()
melgan_model.load_state_dict(checkpoint['model_g'])
melgan_model.eval()

with torch.no_grad():
    mel = torch.from_numpy(mel_sent).unsqueeze(0).transpose(2, 1)
    mel = mel.cuda()

    audio = model.inference(mel)
    audio = audio.cpu().detach().numpy()
mel_sent = tacotron_out(model, sentence, CONFIG, use_cuda, ap, use_gl=use_gl, figures=True)

Where does this line come from? This repo is designed to inferface with NVIDIA/Tacotron.
Nvidia uses their own Spectrogram conversion that I believe outputs values between -12 and 2.

@CookiePPP I see. I use mozilla tts instead.

@CookiePPP I would like to know that whether could we use tacotron gta output to train melgan

@tsungruihon
You should be able to scale the output and get an audible result. I don't know what range Mozilla TTS has, but try to transform the Mozilla output to match the Nvidia one.
e.g

mel_sent = tacotron_out(model, sentence, CONFIG, use_cuda, ap, use_gl=use_gl, figures=True)
mel_sent = (mel_sent * 0.5) + 2

and replace 0.5 and +2 with the values that move the spectrogram between -12 and 2.

@CookiePPP I would like to know that whether could we use tacotron gta output to train melgan

Note sure, I'm busy today so I can't really help you there.

@CookiePPP Really appreciated. Thanks a lot.

I face the same problem Did you find a solution?
@tsungruihon