Hint-based Training for Non-Autoregressive Translation
Opened this issue · 0 comments
kweonwooj commented
Abstract
- propose to leverage
hints
from pre-trained AutoRegressive Translation (ART) model to train Non-AutoRegressive Translation (NART) modelhints
from hidden statehints
from word alignment
- on WMT14 EnDe, 17.8x faster inference with ~2.00 BLEU loss
- NART : 25.20 / 44 ms
- ART : 27.30 / 784 ms
Details
Introduction
- NART models
- fully NART models suffer from loss of accuracy
- To improve the accuracy of decoder,
- Gu et al 2017 introduces
fertilities
from SMT model and copies source tokens to initialize decoder states - Lee et al 2018 propose iterative refinement process
- Kaiser et al 2018 embed an ART that outputs discrete latent variables, then use NART model
- there is a trade-off between inference speed and computational overhead of improving translation accuracy
- Gu et al 2017 introduces
- Contribution
- improve translation accuracy via enriching training signals via two
hints
from pre-trained ART model
- improve translation accuracy via enriching training signals via two
Motivation
- Empirical error analysis of NART models lead to two findings
incoherent phrases and miss meaningful tokens on the source side
- visualized incoherent phrases via cosine similarity of hidden layers
- visualized missing tokens via attention weights
- Enhancing loss function using two additional information (cosine similarity between layers and attention weights) is the main contribution
Hint-based NMT
Hints from hidden state
Hints from word alignment
Initial Decoder State
(z
) : linear combination of source embeddingMultihead Positional Attention
: additional sub-layer in decoder to re-configure the positionsInference Tricks
Length Prediction
: instead of predicting target length, use constant biasC
obtained from train corpus (no computational overhead)Length Range Prediction
: instead of predicting a fixed length, predict over a range of target lengthART re-scoring
: use ART model to re-score multiple target candidates, to select the final one (rescoring can take place in non-autoregressive manner)
Overall Performance
Personal Thoughts
- I totally agree that all the semantics and syntax are in source sentence, hence NART model can work, if we train them correctly
Inference Tricks
seem to be a strong contribution that authors do not explicitly point out- ICLR submission rejected due to insufficient related work/story-telling and
bad luck
Link : https://openreview.net/pdf?id=r1gGpjActQ
Authors : Li et al 2018