keithito/tacotron

Loss Exploded at 115000'th step and the output waves evaluated by the latest checkpoints are unintelligible.

yilmazay74 opened this issue · 0 comments

Hi,
Actually, there is a very similar issue (Issue 226) with my problem that I am facing.
I commented on it. Since it was closed no one answered my questions so, I ended up creating a new issue.
My issue is that I prepared a small training set. This set consists of only 180 audio files. the audios' sample rate is 16k and the bit depth is 16 bit. The longest audio is 20 seconds. Others are between 20 sec and 10 sec. I made a few adjustments in the hparams.py file to get rid of input data shape mismatch error. I listed down all those parametersbelow. The language of my training set is Turkish.

The training took about 45 hours, almost 2 days.
Th training exploded at 115k'th step. Avg time for each step was around 2 seconds.
I evaluated the latest checkpoint model, Unfortunately
All The output waves are unintelligible and they are very short. although they should normally be between 10 and 20 seconds.

My changes are:
HParam name: Original value: Changed value:
cleaners english_cleaners transliteration_cleaner
sample rate 20000 16000
frame_length_ms 50 100
frame_shift_ms 12,5 25
max_iters 200 400

Below is the all params:
cleaners='transliteration_cleaners',

Audio:
num_mels=80,
num_freq=1025,
sample_rate=16000,
frame_length_ms=100,
frame_shift_ms=25,
preemphasis=0.97,
min_level_db=-100,
ref_level_db=20,

Model:
outputs_per_step=5,
embed_depth=256,
prenet_depths=[256, 128],
encoder_depth=256,
postnet_depth=256,
attention_depth=256,
decoder_depth=256,
epochs=100,

Training:
batch_size=32,
adam_beta1=0.9,
adam_beta2=0.999,
initial_learning_rate=0.002,
decay_learning_rate=True,
use_cmudict=False,

Eval:
max_iters=200,
griffin_lim_iters=60,
power=1.5,

I know 180 files for training is too few, however, I was expecting at least the training would end without problems and it would create at least a less accurate model. Prior to this I trained my model with 40 files and it ended at 71000 ' th step (without explosion and error)
and at least it can synthesize the texts in the training set.
Can someone shade some light on the possible cause of this error?
Secondly can someone tell me how the parameter values look?
Any ideas on how to try different values of some tunable parameters to improve the accuracy?
Thirdly with NVIDIA® Tesla® K80 card, the training took about 2 days.
Any ideas on whether it is possible to shorten the overall traning time?
I am sharing my train.log file for your reference.
I will appreciate any help or recommendations.
Best Regards

train.log