Hi. Thanks for the great repo. I want to apply MoEx on text classification. I was wondering what type of normalization I should use for MoEx on a transformer-based model like RoBERTa to get the mean and std.
Thanks for your response. Yes, layer normalization makes sense. With layer norm, I was wondering if the interpolation formula (injecting the moments of sample B to the normalized features of sample A) can stay the same as the one you proposed in your paper.