Modeling Recurrence for Transformer
Opened this issue · 0 comments
kweonwooj commented
Abstract
- propose additional
attentive recurrent network
(ARN) to Transformer encoder to leverage the strengths of both attention and recurrent networks - WMT14 EnDe and WMT17 ZhEn demonstrates the effectiveness
- study reveals that a short-cut bridge of shallow ARN outperforms deep counterpart
Details
Main Approach
- use additional
recurrent encoder
to the source side
recurrent model
can be simple (a) RNN, GRU, LSTM or (b) Attentive Recurrent Network where context representation is generated via attention with previous hidden state
Impact of Components
-
ablation study on size of addition recurrent encoder
-
ablation study on number of recurrent steps in ARN
-
ablation study on how to integrate representation in decoder side
Overall Result
Linguistic Analysis
- what linguistic characteristics are models learning?
1-Layer BiARN
performs better on all syntactic and some semantic tasks
- List of Linguistic Characteristics
- SeLen : sentence length
- WC : recover original words given its source embedding
- TrDep : check whether encoder infers the hierarchical structure of sentences
- ToCo : classify in terms of the sequence of top constituents
- BShif : tests whether two consecutive tokens are inverted
- Tense : predict tense of the main-clause verb
- SubN : number of main-clause subject
- ObjN : number of direct object of the main clause
- SoMo : check whether some sentences are modified by replacing a random noun or verb
- CoIn : two coordinate clauses with half the sentence inverted
Personal Thoughts
- Translation requires a complicated encoding function in source side. Pros of attention, rnn, cnn can be complemented to produce richer representation
- this paper showed that there is a small room of improvement for rnn encoder to play part in Transformer encoder with short-cut trick
Link : https://arxiv.org/pdf/1904.03092v1.pdf
Authors : Hao et al. 2019