soobinseo/Transformer-TTS

about Mel_Post_Net

Opened this issue · 1 comments

Many thanks for your great work.

I have a question about the Post-Net after the Mel Linear layer, I find in the inference stage, you use the mel_pred as output rather than the postnet_pred, which is shown as follows (in synthesis,py):

with t.no_grad():
    for i in pbar:
        pos_mel = t.arange(1,mel_input.size(1)+1).unsqueeze(0).cuda()
        **mel_pred**, postnet_pred, attn, stop_token, _, attn_dec = m.forward(text, mel_input, pos_text, pos_mel)
        mel_input = t.cat([mel_input, **mel_pred**[:,-1:,:]], dim=1)

Also, when I run the code, I find that the post_mel_loss is always bigger than the mel_loss, which means the post_Net module doesn't work as expected, right? I think it is conflicted with the Post-Net module used in Tacotron and the TTS-Transformer original paper. I am a bit confused, can you explain it to me? Many thanks!

Thanks for great work.
I have the same question with @liuxubo717 , I compare with the tacotron2, find that its postnet is "torch.nn.Conv1d", I don't understand that using the "Bi-GRU" in your postnet, does it better than the postnet of Tacotron2? can you explain than? Thanks.