phlippe/uvadlc_notebooks

[Question] Use of MLP in Transformer Encoder/Decoder

pi-tau opened this issue · 2 comments

pi-tau commented

Tutorial: 6

Hi @phlippe,
A question regarding the use of neural nets after the multi-head attention layer inside the encoder/decoder. Instead of an MLP, could we use some other architecture, for example a shallow CNN applied over multiple elements of the sequence? We will probably loose some speed and the model is no longer permutation-equivariant, but that might improve model performance?

I tried searching for research papers exploring this idea, but could not find anything. I also could not find any solid argumentation to why this might be a bad idea. Any thoughts, links and references would be very appreciated.

Thanks!

pi-tau commented

Update:
I found this paper on text-to-speech and for their FFT module (chap 3.1.) they use an encoder with two convolutional layers as non-linearity instead of the MLP (probably modelling a depthwise separable convolution idk?).

Their argument is that nearby sequence elements are more closely related in the character/phoneme sequence in speech tasks. But with byte pair encoding in language tasks wouldn't it make sense to also use this design?

Hi @pi-tau, there exist many variations of Transformers, and you can indeed replace the MLP with a different layer like convolution. In Computer Vision, combining convolutional methods with Transformers was a hot topic for a while, e.g. https://arxiv.org/abs/2107.06263. But in general, when you go to scale, it is usually best to give the model full flexibility by only using MLPs, and have the attention as arbitrary ways of communicating. For very small datasets, such inductive biases are more important. This is why RNNs can still beat Transformers when you train them from scratch on very small NLP datasets.