Confused about how to specify max mel_frames in the output spectrogram and training audio sample length in hparams.py
jjoe1 opened this issue · 0 comments
First thanks for this detailed original tacotron model and the wiki.
I've been trying to read the wiki, this code, and the Tacotron paper (https://arxiv.org/pdf/1703.10135.pdf) for the last several days, but am confused about something basic. As someone trying to learn text-to-speech models, I'm unclear about how the spectrogram of fixed-length is generated for a input text during training.
-
Max ground-truth clip length in LJSpeech dataset is 14sec, then wouldn't that indirectly define the max mel_frames in output to be 14*(1/0.0125)= 14*80=1120 ? What is max_sentence_length of the input-text after padding? I assume all the input sentences used during training and inference would be padded to a max_len, is that correct?
-
Another related issue which may be a beginner question: After encoder creates 256 hidden states (from 256 bidirectional lstms), isn't the decoder output limited to 256 frames (for output layer reduction factor r=1)? If I understand encoder-decoder correctly, if decoder is producing 1 frame per encoder state as r=1, then how can it produce more frames than encoder states?