Integration Tacotron
twidddj opened this issue · 4 comments
So far, I couldn't find the model which attention works in "reduction factor" = 1. If we use the factor > 1, the prediction would seem like below image. It would be a bad news to wavenet performance.
Here, the original Mel-spectrum is
You can find some discussion for this issue on @Rayhane-mamah's repo and @keithito's repo also.
Hi @twidddj, thanks for sharing your work!
I am assuming you trained the wavenet vocoder on ground truth mels? Did you try training it on ground truth aligned samples generated with the Tacotron ( r > 1 ) model?
In best cases the wavenet will learn to map mels correctly despite the noise in them. If it doesn't work, we'll try figuring out why attention isn't working with r=1. (Have not tested it yet, I get my gpu this week so I'll tell you how it goes)
Hi @Rayhane-mamah, welcome!
Yes, you are right. It has trained on ground truth mels not GTA. I have not tested it yet and have a plan to do it using @keithito's pretrained model(r=5) in next week. If you tell me how it goes on, it must be very helpful to me. We would get an achievement while we are at it.
We have tried some works for this issue.
- Tested Rayhane-mamah's Tacotron-2 with r=1. It's attention works and the intelligibility of TTS remarkably improved compared to previous version. However there is another issue reported by him. We believe the problem would be solved soon. Thanks!
- Tested our vocoder on the mel spectrograms computed through the same method as Tacotron2 paper( 2048 fft_size, 300 hop_size, 1300 window_size on 24K sample rate). Although it seems to require more training steps(over 1000K) than r9y9's setting, it works too. Thanks to @Ondal90!