kweonwooj/papers

Hint-based Training for Non-Autoregressive Translation

Opened this issue · 0 comments

Abstract

  • propose to leverage hints from pre-trained AutoRegressive Translation (ART) model to train Non-AutoRegressive Translation (NART) model
    • hints from hidden state
    • hints from word alignment
  • on WMT14 EnDe, 17.8x faster inference with ~2.00 BLEU loss
    • NART : 25.20 / 44 ms
    • ART : 27.30 / 784 ms

Details

Introduction

  • NART models
    • fully NART models suffer from loss of accuracy
  • To improve the accuracy of decoder,
    • Gu et al 2017 introduces fertilities from SMT model and copies source tokens to initialize decoder states
    • Lee et al 2018 propose iterative refinement process
    • Kaiser et al 2018 embed an ART that outputs discrete latent variables, then use NART model
    • there is a trade-off between inference speed and computational overhead of improving translation accuracy
  • Contribution
    • improve translation accuracy via enriching training signals via two hints from pre-trained ART model

Motivation

  • Empirical error analysis of NART models lead to two findings
    • incoherent phrases and miss meaningful tokens on the source side
  • visualized incoherent phrases via cosine similarity of hidden layers
    • NART models w/o hints have higher cosine similarity across hidden layers which leads to repetitive outputs
      screen shot 2019-01-03 at 1 43 19 pm
  • visualized missing tokens via attention weights
    • NART models w/o hints have low accuracy on attention weights
      screen shot 2019-01-03 at 1 44 11 pm
  • Enhancing loss function using two additional information (cosine similarity between layers and attention weights) is the main contribution

Hint-based NMT

screen shot 2019-01-03 at 1 40 12 pm

  • Hints from hidden state
    • provide penalty when NART hidden states are similar but ART hidden states are not.
      screen shot 2019-01-03 at 1 45 43 pm
  • Hints from word alignment
    • KL Divergence loss
      screen shot 2019-01-03 at 1 47 10 pm
  • Initial Decoder State (z) : linear combination of source embedding
    • exponential weight with source tokens in closer index having more weights
      screen shot 2019-01-03 at 1 47 50 pm
  • Multihead Positional Attention : additional sub-layer in decoder to re-configure the positions
  • Inference Tricks
    • Length Prediction : instead of predicting target length, use constant bias C obtained from train corpus (no computational overhead)
    • Length Range Prediction : instead of predicting a fixed length, predict over a range of target length
    • ART re-scoring : use ART model to re-score multiple target candidates, to select the final one (rescoring can take place in non-autoregressive manner)

Overall Performance

  • 17.8x speed-up with 1.90 BLEU loss in WMT14 EnDe
    screen shot 2019-01-03 at 1 51 17 pm

Personal Thoughts

  • I totally agree that all the semantics and syntax are in source sentence, hence NART model can work, if we train them correctly
  • Inference Tricks seem to be a strong contribution that authors do not explicitly point out
  • ICLR submission rejected due to insufficient related work/story-telling and bad luck

Link : https://openreview.net/pdf?id=r1gGpjActQ
Authors : Li et al 2018