How to inference using MelGAN given a tacotron mel spec output?

Question

How to inference using MelGAN given a tacotron mel spec output?

OswaldoBornemann opened this issue 5 years ago · 11 comments

When i trained melgan with original wav's mel spec, the result went well.

But when i tried to feed tacotron mel spec output into trained melgan model, the sound just all bee. Would you mind sharing some advice? thanks a lot. @seungwonpark

Answer 1 · 2020-03-09T09:25:06.000Z

upload sound samples?

Answer 2 · 2020-03-09T09:44:03.000Z

@CookiePPP Please set the volume into lowest... I don't want to hurt your ears...

bad result.wav.zip

Answer 3 · 2020-03-09T09:46:14.000Z

Do you have the code you used to feed the tacotron outputs into melgan uploaded somewhere?
That's definitely bugged out.

Answer 4 · 2020-03-09T09:51:47.000Z

@CookiePPP The process are kind like below:

First i get the mel spec output from tacotron, using like

# mel sent shape is (spec_length, 80)
mel_sent = tacotron_out(model, sentence, CONFIG, use_cuda, ap, use_gl=use_gl, figures=True)

Then i unsqueeze and transpose the mel result to feed into MelGAN.

checkpoint_path = "./melgan/chkpt/id_test1/id_test1_aca5990_0700.pt"
config = "./melgan/config/id_test1.yaml"

checkpoint = torch.load(checkpoint_path)
# if args.config is not None:
#     hp = HParam(config)
# else:
hp = load_hparam_str(checkpoint['hp_str'])

melgan_model = Generator(hp.audio.n_mel_channels).cuda()
melgan_model.load_state_dict(checkpoint['model_g'])
melgan_model.eval()

with torch.no_grad():
    mel = torch.from_numpy(mel_sent).unsqueeze(0).transpose(2, 1)
    mel = mel.cuda()

    audio = model.inference(mel)
    audio = audio.cpu().detach().numpy()

Answer 5 · 2020-03-09T09:55:29.000Z

mel_sent = tacotron_out(model, sentence, CONFIG, use_cuda, ap, use_gl=use_gl, figures=True)

Where does this line come from? This repo is designed to inferface with NVIDIA/Tacotron.
Nvidia uses their own Spectrogram conversion that I believe outputs values between -12 and 2.

Answer 6 · 2020-03-09T09:59:04.000Z

@CookiePPP I see. I use mozilla tts instead.

Answer 7 · 2020-03-09T10:00:25.000Z

@CookiePPP I would like to know that whether could we use tacotron gta output to train melgan

Answer 8 · 2020-03-09T10:02:31.000Z

@tsungruihon
You should be able to scale the output and get an audible result. I don't know what range Mozilla TTS has, but try to transform the Mozilla output to match the Nvidia one.
e.g

mel_sent = tacotron_out(model, sentence, CONFIG, use_cuda, ap, use_gl=use_gl, figures=True)
mel_sent = (mel_sent * 0.5) + 2

and replace 0.5 and +2 with the values that move the spectrogram between -12 and 2.

@CookiePPP I would like to know that whether could we use tacotron gta output to train melgan

Note sure, I'm busy today so I can't really help you there.

Answer 9 · 2020-03-09T10:05:43.000Z

@CookiePPP Really appreciated. Thanks a lot.

Answer 10 · 2020-11-15T13:29:44.000Z

I face the same problem Did you find a solution?
@tsungruihon

Answer 11 · 2020-11-16T00:33:54.000Z

Please visit https://github.com/mozilla/TTS