kweonwooj/papers

Self-Attention with Relative Position Representations

Opened this issue · 0 comments

Abstract

  • present relative position representation in self-attention mechanism to efficiently consider representations of the relative positions
  • WMT14 EnDe +1.3 BLEU, EnFr +0.3 BLEU

Details

Introduction

  • Position Representation in Sequence to Sequence task
    • RNN captures relative and absolute position along the time dimension directly through their sequential structure
    • CNN captures relative positions within the kernel size of each convolution, but they have been shown to benefit from position encoding (sinusoidal)
    • Self-Attention mechanism is invariant to sequence ordering, hence requires position representation
      • original Transformer paper used sinusoidal based position encoding
      • this paper contributes by proposing relative position representation for self-attention mechanism

Relation-aware Self-Attention

  • Self-Attention
    • input sequence x = (x_1, .., x_n) is mapped to a new sequence z = (z_1, .., z_n) where dot product (attention) between all pairs are calculated in eq. (2)
      screen shot 2018-04-16 at 12 25 13 pm
    • attention weight (alpha) is computed by softmax function
      screen shot 2018-04-16 at 12 25 24 pm
    • final token representation is computed as weighted sum of a linear transformation as in eq. (1)
      screen shot 2018-04-16 at 12 25 26 pm
  • Relation-aware Self-Attention
    • input, output is same as Self-Attention, edge information is added to the output via addition. eq.(2) in self-attention becomes eq.(4) where relative position representation a_i,j is added
      screen shot 2018-04-16 at 12 28 22 pm
    • final token representation is computed with edge information added. eq.(1) becomes eq.(3)
      screen shot 2018-04-16 at 12 28 18 pm

Relative Position Representation

  • maximum relative position is clipped to a max absolute value of k
    screen shot 2018-04-16 at 12 29 47 pm

Efficient Implementation

  • splitting the computation of eq.(4) into two terms makes computation more efficient. ~7% decrease in steps per second
    screen shot 2018-04-16 at 3 49 10 pm

Experiments

  • WMT14 EnDe +1.3 BLEU
  • WMT14 EnFr +0.3 BLEU
    screen shot 2018-04-16 at 3 50 11 pm

Ablation Experiments

  • clipping distance k
    • k >= 2, no significant improvement beyond
      screen shot 2018-04-16 at 3 50 40 pm
  • Position of Relative Representation
    • a^K suffices to represent relative position. need further work
      screen shot 2018-04-16 at 3 51 16 pm

Personal Thoughts

  • was always curious about the role and effectiveness of sinusoidal positional encoding, good to see improvement in BLEU, but wish to see qualitative analysis on how each relative position representations are learnt

Link : https://arxiv.org/pdf/1803.02155.pdf
Authors : Shaw et al. 2018