G-Wang/WaveRNN-Pytorch

Loss curve

chaiyujin opened this issue · 10 comments

Hi, I am currently training a wavernn, however, it's hard to train. Have you ever trained a good model? Would you mind to share your loss curve at training phase.

Hi, which version of WavRNN are you using? I trained on Fatcord's original WavRNN and it had difficulty converging as well, however his modified version the preliminary results are better.

I am porting his code from jupyter notebook and will add training curves when training on LJ speech is done. I'm trying to see if I can get gaussian output working too (from clarinet)

Inspired by Fatcord's modified version, I use two gru to train the model described in paper (dual softmax). It runs much faster than but equivalent to Fatcord's original version (I think they are equal).
My local condition is upsampled mel only. Because I think 1d residual module is similar with upsample module.
Currently, my CrossEntropy training loss is ~2.7. However it seems infered audio (incrementally inference) is not good.

Once you get good result with LJSpeech, please tell me. I also use LJSpeech to train model.
Thank you.

Ok, I will update you once the model is trained well. Would you mind sharing some audio samples of your wavernn vocoder for reference?

My model failed to infer good audio. I though there are some bugs in my code. I'm curious about gaussian output you mentioned before. Is that similar to mixture logistic loss used in wavenet?

By the way, have you ever tried FFTNet? Is that fast and good as claimed in paper? I want to find which is faster to train and relative good. Wavenet is really slow to train (about 1 weak).

I have trained a wavernn, with coarse-fine bits [4, 4]. The result is not that good. But it only takes one night to get a trained model.

lj001-0001

Hello, sorry for late response, I was on vacation.

I've trained a model for about 20 hours with Fatcord's WaveRNN, below are the plots for ground truth versus produced, as well as the audio sample.

Note audio is 9 bit.

ground truth
gt

predicted conditioned on Mel (fft_size=1024, hop_size=256, win_length=1024)
pred

Audio samples:

audio_samples.tar.gz

I think this can be trained further, there're minor statics in the audio. You can hear the static in the ground truth audio as well, since it's been discretized to 9-bit audio. Training on a higher bit-rate model will likely eliminate these, or with post-processing methods to remove the static.

Generating on my computer (GTX 1060, i5, 16 GB ram) is about 2000 samples per second

You can try it yourself with his updated code: https://github.com/fatchord/WaveRNN.git

Only thing I changed were the mel parameters (n_fft=1024, hop_size=256, win_length=1024) in dsp.py and the model upsample_factors to (4, 4, 16) to match my hop size.

That sounds good. Is that training data? Or just a testing data.

These are from the test data, about 50 sentences were held out for testing.

I'm currently finishing up training the text-to-mel part (DCTTS) in a few hours. I'll generated some samples and use the vocoder to generate some outputs. I'll post samples here once done.

here's audio with new sentence, the static is still an issue, probably with the 9 bit audio.
new_sent.tar.gz

how is your models coming along?

Your new sentence sounds good except some noise. But it sounds natural. Is that generated by DCTTS and WaveRNN together?

Do you have some ideas about processing text for better attention alignments ?
I'm working on Tacotron2, spent about one month on it. However the result is not good.