Modeling Recurrence for Transformer

Question

Opened this issue 5 years ago · 0 comments

Abstract

propose additional attentive recurrent network(ARN) to Transformer encoder to leverage the strengths of both attention and recurrent networks
WMT14 EnDe and WMT17 ZhEn demonstrates the effectiveness
study reveals that a short-cut bridge of shallow ARN outperforms deep counterpart

use additional recurrent encoder to the source side
recurrent model can be simple (a) RNN, GRU, LSTM or (b) Attentive Recurrent Network where context representation is generated via attention with previous hidden state

ablation study on size of addition recurrent encoder
- smaller BiARN encoder attached directly to top of decoder outperforms all others
ablation study on number of recurrent steps in ARN
- ~8 seems optimal
ablation study on how to integrate representation in decoder side
- stack on top outperformed all others

what linguistic characteristics are models learning?
- 1-Layer BiARN performs better on all syntactic and some semantic tasks
List of Linguistic Characteristics
- SeLen : sentence length
- WC : recover original words given its source embedding
- TrDep : check whether encoder infers the hierarchical structure of sentences
- ToCo : classify in terms of the sequence of top constituents
- BShif : tests whether two consecutive tokens are inverted
- Tense : predict tense of the main-clause verb
- SubN : number of main-clause subject
- ObjN : number of direct object of the main clause
- SoMo : check whether some sentences are modified by replacing a random noun or verb
- CoIn : two coordinate clauses with half the sentence inverted

Translation requires a complicated encoding function in source side. Pros of attention, rnn, cnn can be complemented to produce richer representation
this paper showed that there is a small room of improvement for rnn encoder to play part in Transformer encoder with short-cut trick