Pay Less Attention with Lightweight and Dynamic Convolutions
Opened this issue · 1 comments
kweonwooj commented
Abstract
- self-attention is strong, but its effect on long-range dependency is in question
- propose
lightweight convolution
anddynamic convolution
, a convolution as a function of timestep which is lightweight and cost is linear in input length + performs better or on-par with self-attention in machine translation, summarization and language modeling - in machine translation, WMT14 EnDe SoTA of 29.7 BLEU
Details
Background
Light-weight Convolution
- weight sharing (
H = 16
) + softmax normalized depth-wise convolution - DropConnect is used for regularization
Dynamic Convolution
Overall Structure
Results
Personal Thoughts
- impressive result, improving both performance and speed against Transformer
- I wonder what timestep dependent kernel is capturing
- will the performance with small number of layers be equivalent? because CNN seem to gather contextual information via stacking, whereas self-attention can obtain global context in single operation
Link : https://openreview.net/pdf?id=SkVhlh09tX
Authors : Wu et al. 2018
demdecuong commented
As i guess,
We have GLU to expand the dimension into nx2d then we go to conv to rescale it into nxd right ?
I still dont understand how to apply the softmax in LightweightConv . We would softmax all the kernel weight of the Conv layer right ?
Moreover, i am not clear about the weight-sharing of the author since i try to re-implement this architecture.
Please give me some explanation .
Thank you so much.