Pay Less Attention with Lightweight and Dynamic Convolutions

Question

Pay Less Attention with Lightweight and Dynamic Convolutions

Opened this issue 6 years ago · 1 comments

kweonwooj commented 6 years ago

Abstract

self-attention is strong, but its effect on long-range dependency is in question
propose lightweight convolution and dynamic convolution, a convolution as a function of timestep which is lightweight and cost is linear in input length + performs better or on-par with self-attention in machine translation, summarization and language modeling
in machine translation, WMT14 EnDe SoTA of 29.7 BLEU

Details

Background

Depth-wise Convolution : performs convolution independently over every channel

Light-weight Convolution

weight sharing (H = 16) + softmax normalized depth-wise convolution
DropConnect is used for regularization

Dynamic Convolution

timestep dependent kernel function + light-weight convolution

Overall Structure

Results

DynamicConv achieves 29.7 BLEU on WMT EnDe with same param count as Transformer Big
Ablation
- speed is 20% faster with DynamicConv

Personal Thoughts

impressive result, improving both performance and speed against Transformer
I wonder what timestep dependent kernel is capturing
will the performance with small number of layers be equivalent? because CNN seem to gather contextual information via stacking, whereas self-attention can obtain global context in single operation

Link : https://openreview.net/pdf?id=SkVhlh09tX
Authors : Wu et al. 2018

Answer 1 · 2020-08-05T10:24:15.000Z

As i guess,
We have GLU to expand the dimension into nx2d then we go to conv to rescale it into nxd right ?
I still dont understand how to apply the softmax in LightweightConv . We would softmax all the kernel weight of the Conv layer right ?

Moreover, i am not clear about the weight-sharing of the author since i try to re-implement this architecture.

Please give me some explanation .

Thank you so much.