jinglescode/papers

On the relationship between self-attention and convolutional layers

jinglescode opened this issue · 0 comments

Paper

Link: https://arxiv.org/pdf/1911.03584.pdf
Year: 2020

Summary

  • attention layers can perform convolution, they learn to behave similar to convolutional layers
  • multi-head self-attention layer with sufficient number of heads is at least as expressive as any convolutional layer

image

This theorem holds for 1D and 2D convolution layers.

  • similarity between convolution and multi-head self-attention is striking when the query pixel is slid over the image

Conclusion

We showed that self-attention layers applied to images can express any convolutional layer (given sufficiently many heads) and that fully-attentional models learn to combine local behavior (similar to convolution) and global attention based on input content. More generally, fully-attentional models seem to learn a generalization of CNNs where the kernel pattern is learned at the same time as the filters—similar to deformable convolutions (Dai et al., 2017; Zampieri, 2019). Interesting directions for future work include translating existing insights from the rich CNNs literature back to transformers on various data modalities, including images, text and time series.