janvainer/speedyspeech

the choice of positional encoding

Closed this issue · 6 comments

Why did you choose adding positional encoding with the input of decoder rather than the output of embedding?

Hi, what do you mean by output of embedding? Do you mean output of the encoder? The positional encoding is supposed to make it easier for the decoder to deal with repeated sequences of phoneme encodings. Based on the duration prediction for a given phoneme, the phoneme encoding is repeated that many times on the decoder input. Since this creates homogeneous segments of vectors, the decoder's task is to learn an expansive mapping from few vectors to many different vectors (spectrogram frames). The positional encoding is supposed to make each input of the decoder unique so that the mapping is closer to one-to-one mapping and easier to learn. Adding positional encoding to the encoder output before expanding each phoneme encoding would not help with the above described problem. Does that answer your question?

image
The output of embedding I refered to is as shown in the picture. Based on my understanding for your answer, the decoder needs inputs constrained to position information. But, doesn't encoder need related information? I notice the PE will always be added with the output of nn.embedding, such as in Transformer and related models.

In Transformer related models, PE is usually added to the encoder input because the attention layers do not have a notion of location by themselves -- they work on a global scale. In SpeedySpeech, the encoder is fully convolutional. Convolutions work on a local scale and work with neighboring elements in the sequence so the notion of position is implicitly encoded in the layers themselves. You can try adding PE also to encoder, but it works without it too.

Got it! Thank you. Besides, I found the SpeedySpeech cannot deal with long sentences well. When a long sentence is as input, the synthesized audio is not stable and some segments are totally abnormal. Preliminary I guess the receiptive field is not enough, especially for encoder(in which the kernel size=4 and max dilation is 4). Have you any other ideas about the synthesizing stability of long sentences?

Could you post some examples that did not work for you? I tested the model on 10 second audios and had no problems with it, but it may be possible that much longer audio can be problematic -- I haven't tested eg full paragraphs or so.
Also are you training the model from scratch, or are you using our pretrained checkpoint? The model can yield unstable results if the teacher model for alignment is not trained properly as is discussed in this this issue.

The long sentences I mentioned before do refer to samples that are more than 10 seconds.