kweonwooj/papers

Modeling Recurrence for Transformer

Opened this issue · 0 comments

Abstract

  • propose additional attentive recurrent network(ARN) to Transformer encoder to leverage the strengths of both attention and recurrent networks
  • WMT14 EnDe and WMT17 ZhEn demonstrates the effectiveness
  • study reveals that a short-cut bridge of shallow ARN outperforms deep counterpart

Details

Main Approach

  • use additional recurrent encoder to the source side
    Screen Shot 2019-04-15 at 10 30 23 AM
  • recurrent model can be simple (a) RNN, GRU, LSTM or (b) Attentive Recurrent Network where context representation is generated via attention with previous hidden state
    Uploading Screen Shot 2019-04-15 at 10.31.05 AM.png…

Impact of Components

  • ablation study on size of addition recurrent encoder

    • smaller BiARN encoder attached directly to top of decoder outperforms all others
      Screen Shot 2019-04-15 at 10 33 12 AM
  • ablation study on number of recurrent steps in ARN

    • ~8 seems optimal
      Screen Shot 2019-04-15 at 10 35 23 AM
  • ablation study on how to integrate representation in decoder side

    • stack on top outperformed all others
      Screen Shot 2019-04-15 at 10 34 06 AM
      Screen Shot 2019-04-15 at 10 34 33 AM

Overall Result

  • with additional ARN encoder, BLEU scores improve with statistical significance
    Screen Shot 2019-04-15 at 10 34 41 AM

Linguistic Analysis

  • what linguistic characteristics are models learning?
    • 1-Layer BiARN performs better on all syntactic and some semantic tasks
  • List of Linguistic Characteristics
    • SeLen : sentence length
    • WC : recover original words given its source embedding
    • TrDep : check whether encoder infers the hierarchical structure of sentences
    • ToCo : classify in terms of the sequence of top constituents
    • BShif : tests whether two consecutive tokens are inverted
    • Tense : predict tense of the main-clause verb
    • SubN : number of main-clause subject
    • ObjN : number of direct object of the main clause
    • SoMo : check whether some sentences are modified by replacing a random noun or verb
    • CoIn : two coordinate clauses with half the sentence inverted
      Screen Shot 2019-04-15 at 10 35 56 AM

Personal Thoughts

  • Translation requires a complicated encoding function in source side. Pros of attention, rnn, cnn can be complemented to produce richer representation
  • this paper showed that there is a small room of improvement for rnn encoder to play part in Transformer encoder with short-cut trick

Link : https://arxiv.org/pdf/1904.03092v1.pdf
Authors : Hao et al. 2019