janvainer/speedyspeech

clarification request - normalization used in teacher model vs student

Closed this issue · 4 comments

@janvainer2 :

the normalization used in the teacher model is MinMaxNorm and in student model is StandardNorm. correspondingly the final layer in teach and student models are sigmoid and conv1d-identity respectively.

any reason why different normalization choices was made for teacher vs student?

Hi, good question! The main reason for using sigmoid for the teacher was minimization of error accumulation during inference. The teacher model uses its output as its next input. If the output is slightly different from what the ground truth would be, the next output will be even more different from what should actually be generated. The error keeps increasing with each generated frame and the network inference can completely diverge for long sequences. I also experimented with tanh and linear activations, but the sigmoid activation generated the lowest absolute error and the error accumulation was less of a problem compared to tanh and linear. To be able to use sigmoid, you should apply MinMax norm so that all the data are in [0, 1].

The student is non-autoregressive and the outputs, so no need for MinMax.

Thanks. Then, thinking that we can use the min max norm even for the student model. It will give a 0 to 1 normalised Mel spectrogram , which can be used later.

Do you see catches ?

I would not use sigmoid if not necessary. It has worse gradient flow properties than a simple linear layer or relu and is also more expensive to compute. That being said, if the features are standardized to have zero mean and unit variance, the student network fits approximately gaussian data with a gaussian model, which is nice and should be easier for the network. (smaller risk of model and data missmatch)

thanks for clarification.