Integration Tacotron

Question

Integration Tacotron

twidddj opened this issue 7 years ago · 4 comments

So far, I couldn't find the model which attention works in "reduction factor" = 1. If we use the factor > 1, the prediction would seem like below image. It would be a bad news to wavenet performance.

Here, the original Mel-spectrum is

You can find some discussion for this issue on @Rayhane-mamah's repo and @keithito's repo also.

Answer 1 · 2018-04-10T10:29:24.000Z

Hi @twidddj, thanks for sharing your work!

I am assuming you trained the wavenet vocoder on ground truth mels? Did you try training it on ground truth aligned samples generated with the Tacotron ( r > 1 ) model?
In best cases the wavenet will learn to map mels correctly despite the noise in them. If it doesn't work, we'll try figuring out why attention isn't working with r=1. (Have not tested it yet, I get my gpu this week so I'll tell you how it goes)

Answer 2 · 2018-04-10T12:08:19.000Z

Hi @Rayhane-mamah, welcome!

Yes, you are right. It has trained on ground truth mels not GTA. I have not tested it yet and have a plan to do it using @keithito's pretrained model(r=5) in next week. If you tell me how it goes on, it must be very helpful to me. We would get an achievement while we are at it.

Answer 3 · 2018-04-10T13:42:28.000Z

Yes I am counting on fully training my tacotron an train a wavenet on its GTA output in the upcoming week, I'll tell you how it goes.

…

On Tue, 10 Apr 2018, 13:08 twidddj, ***@***.***> wrote: Hi @Rayhane-mamah <https://github.com/Rayhane-mamah>, welcome! Yes, you are right. It has trained on ground truth mels not GTA. I have not tested it yet and have a plan to do it using @keithito <https://github.com/keithito>'s pretrained model(r=5) in next week. If you tell me how it goes on, it must be very helpful to me. We would get an achievement while we are at it. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#2 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AhFSwCYgXQMmOIUjkj938t8rPlOt0VcYks5tnKC0gaJpZM4TN8ZY> .

Answer 4 · 2018-04-23T08:15:23.000Z

We have tried some works for this issue.

Tested Rayhane-mamah's Tacotron-2 with r=1. It's attention works and the intelligibility of TTS remarkably improved compared to previous version. However there is another issue reported by him. We believe the problem would be solved soon. Thanks!
Tested our vocoder on the mel spectrograms computed through the same method as Tacotron2 paper( 2048 fft_size, 300 hop_size, 1300 window_size on 24K sample rate). Although it seems to require more training steps(over 1000K) than r9y9's setting, it works too. Thanks to @Ondal90!