Self-Attention with Relative Position Representations
Opened this issue · 0 comments
kweonwooj commented
Abstract
- present relative position representation in self-attention mechanism to efficiently consider representations of the relative positions
- WMT14 EnDe +1.3 BLEU, EnFr +0.3 BLEU
Details
Introduction
- Position Representation in Sequence to Sequence task
- RNN captures relative and absolute position along the time dimension directly through their sequential structure
- CNN captures relative positions within the kernel size of each convolution, but they have been shown to benefit from position encoding (sinusoidal)
- Self-Attention mechanism is invariant to sequence ordering, hence requires position representation
- original Transformer paper used sinusoidal based position encoding
- this paper contributes by proposing relative position representation for self-attention mechanism
Relation-aware Self-Attention
- Self-Attention
- input sequence
x = (x_1, .., x_n)
is mapped to a new sequencez = (z_1, .., z_n)
where dot product (attention) between all pairs are calculated in eq. (2)
- attention weight (
alpha
) is computed by softmax function
- final token representation is computed as weighted sum of a linear transformation as in eq. (1)
- input sequence
- Relation-aware Self-Attention
Relative Position Representation
Efficient Implementation
- splitting the computation of eq.(4) into two terms makes computation more efficient. ~7% decrease in steps per second
Experiments
Ablation Experiments
- clipping distance
k
- Position of Relative Representation
Personal Thoughts
- was always curious about the role and effectiveness of sinusoidal positional encoding, good to see improvement in BLEU, but wish to see qualitative analysis on how each relative position representations are learnt
Link : https://arxiv.org/pdf/1803.02155.pdf
Authors : Shaw et al. 2018