kweonwooj/papers

Pay Less Attention with Lightweight and Dynamic Convolutions

Opened this issue · 1 comments

Abstract

  • self-attention is strong, but its effect on long-range dependency is in question
  • propose lightweight convolution and dynamic convolution, a convolution as a function of timestep which is lightweight and cost is linear in input length + performs better or on-par with self-attention in machine translation, summarization and language modeling
  • in machine translation, WMT14 EnDe SoTA of 29.7 BLEU

Details

Background

  • Depth-wise Convolution : performs convolution independently over every channel
    screen shot 2019-01-15 at 1 30 30 pm

Light-weight Convolution

  • weight sharing (H = 16) + softmax normalized depth-wise convolution
  • DropConnect is used for regularization
    screen shot 2019-01-15 at 1 30 36 pm

Dynamic Convolution

  • timestep dependent kernel function + light-weight convolution
    screen shot 2019-01-15 at 1 30 42 pm
    screen shot 2019-01-15 at 1 30 02 pm

Overall Structure

screen shot 2019-01-15 at 1 30 25 pm

Results

  • DynamicConv achieves 29.7 BLEU on WMT EnDe with same param count as Transformer Big
    screen shot 2019-01-15 at 1 34 21 pm
  • Ablation
    • speed is 20% faster with DynamicConv
      screen shot 2019-01-15 at 1 34 27 pm

Personal Thoughts

  • impressive result, improving both performance and speed against Transformer
  • I wonder what timestep dependent kernel is capturing
  • will the performance with small number of layers be equivalent? because CNN seem to gather contextual information via stacking, whereas self-attention can obtain global context in single operation

Link : https://openreview.net/pdf?id=SkVhlh09tX
Authors : Wu et al. 2018

As i guess,
We have GLU to expand the dimension into nx2d then we go to conv to rescale it into nxd right ?
I still dont understand how to apply the softmax in LightweightConv . We would softmax all the kernel weight of the Conv layer right ?

Moreover, i am not clear about the weight-sharing of the author since i try to re-implement this architecture.

Please give me some explanation .

Thank you so much.