kweonwooj/papers

Achieving Human Parity on Automatic Chinese to English News Translation

Opened this issue · 0 comments

Abstract

  • reports that the quality of Microsoft's Chinese-to-English machine translation on news sentences is at human parity
    • uses WMT2017 news translation data
    • defines how to accurately measure human parity in translation
    • describes the workflow and various experiments

Details

Defining Human Parity

  • official definition : If there is no statistically significant difference between human quality scores for a test set of candidate translations from a machine translation system and the scores for the corresponding human translations then the machine has achieved human parity
  • Evaluation method
    • use direct assessment described in WMT17
    • use source-based evaluation methodology described in IWSLT17
    • annotators are shown source text and a candidate translation and asked the questions How accurately does the above candidate text convey the semantics of the source text?, answering this using a slider ranging from 0 to 100 (100 being perfect)
    • to identify unreliable crowd workers, direct assessment includes artificially degraded translation output randomly

Neural Machine Translation

  • LSTM, ConvS2S, Transformer are all SoTA models, but choose Transformer as baseline

Main Contributions

  • Main Techniques used to achieve human parity
    • Careful data selection and filtering
    • Dual Learning to utilize the duality of the translation problem
    • Iterative joint training algorithm described in Zhang et al. 2018 to enhance the effect of monolingual source using Back Translation
    • Deliberation Network to refine translation based on two-pass decoding
    • New training objective over KL divergence to encourage agreement between left-to-right and right-to-left translation
    • System Combination and Re-ranking

Data Selection and Filtering

  • Learn a bilingual sentence vector representation mapped into the same space to filter the noisy data and select relevant data
    • use method in Zoph et al. 2016 on subset of data known to be of good quality and relevant domain
    • use RNN enc-dec similar to GNMT as base model for representation learning
    • use cosine similarity of sentence representation of source S and target T
    • remove sentences with similarity below a specified threshold
  • Rule-based filtering
    • both source and target sentence should contain at least 3 words, at most 70 words
    • pairs with ( src_len < 1.3 * tgt_len, tgt_len < 1.3 * src_len) removed
    • sentences with illegal chars (URL, char of other language) removed
    • Chinese sentence without any Chinese characters removed
    • duplicated sentence pairs are removed

Dual Learning

Iterative Joint Training

screen shot 2018-03-26 at 4 29 09 pm

Deliberation Network

screen shot 2018-03-26 at 4 29 28 pm

L2R, R2L Agreement Regularization

  • signals from R2L model can be leveraged to alleviate the exposure bias problem of L2R model and vice versa
    screen shot 2018-03-26 at 4 29 41 pm

System Combination and Re-ranking

  • combine n-best hypotheses from all systems and train a re-ranker using k-best MIRA (margin-based classification algorithm)
  • Features used for re-ranking are
    • original system score, 5-gram LM score, R2L score, Target2Source system re-score, cross-lingual sentence similarity between source and hypothesis
  • turned out to be that original system score, LM score, R2L score, R2L sentence vector similarity and Target2Source sentence similarity were best features

NMT Pipeline

  • Train ZhEn, EnZh Transformer model using DUL, DSL with bilingual corpus (multiple models can be trained to leverage ensemble)
  • Generate back-translation corpus using En & Zh monolingual sentences and pre-trained models from previous step
  • Train Transformer Model or Deliberation Network with inflated bilingual corpus, use pre-trained model's weight to initialize encoder and first-pass decoder of Deliberation Network
    screen shot 2018-03-26 at 4 37 46 pm

Experiments - Benchmark on WMT17

  • Data
    • WMT17 EnZh 18M bilingual pairs. newsdev2017 as dev, newstest2017 as test set.
    • use LM trained on 18M bilingual pair to filter monolingual sentence from news.crawl and common.crawl
  • Vocab
    • Byte Pair Encoding (BPE) of Zh 44k, En 33k
  • Model
    • Transformer Big with Tensor2Tensor v1.3.0 open-source
    • 8x M40 GPUs
    • 200k Adam w learning_rate 0.3, decayed with noam schedule
    • 5120 words per batch, checkpoints created every 60 min
    • results are reported on averaged parameters of last 20 checkpoints
    • beam=8, length_penalty=1.0
    • reported score using sacreBLEU v1.2.3
  • BLEU Score
    • Back-translation (BT) + Dual Learning + Deliberation Network combination performs best
    • Agreement Regularization does not add improvement much
      screen shot 2018-03-26 at 5 02 13 pm

Experiment on Larger Corpus

  • Data
    • WMT17 18M + 35M/50M subset selected from 100M UN corpus
  • Vocab : same
  • Model
    • Transformer Big with 8,192 hidden_size in conv-1 block (bigger than original Transformer Big)
    • 300K Adam
    • minibatch of 3,500 with 8 GPUs
    • same beam, length_penalty and averaging param
  • BLEU Score
    • Base8k (larger model) performs better
    • additional corpus selected via SentVect enhances BLEU score best
      screen shot 2018-03-26 at 5 16 25 pm

Human Evaluation Results

  • Ensembles (Combo-4,5,6) obtains human parity (equivalent score with Reference-HT)
    • Reference-HT are human translations without using online translation engines
    • Reference-PE are human post-edit output based on Google Translate results
    • Reference-WMT are original newstest2017 reference released after WMT17
    • Online-A-1710 : Microsoft Translator collected on Oct 2017
    • Online-A-1710 : Google Translator collected on Oct 2017
      screen shot 2018-03-26 at 5 18 50 pm
      screen shot 2018-03-26 at 5 19 12 pm

Evaluation Campaigns

  • Motivated to resolve issues with human evaluation processes
    • Annotator variability : what if same annotator provides different results on same data? -> resolved by running three campaigns on same evaluation data, and has seen a near complete overlap
    • Data variability : conduct evaluations on completely different subsets of the test data. although, the test data may already be biased..

Human Analysis

  • preliminary human error analysis on best system
    screen shot 2018-03-26 at 5 26 40 pm

Personal Thoughts

  • Complete NMT workflow from data selection upto human evaluation and error analysis
  • impressed on intense experiments, but methods are biased toward MS ideas
  • human evaluation comparison between baseline model and improved model would have been interesting, see how the error types had been reduced

Link : https://www.microsoft.com/en-us/research/uploads/prod/2018/03/final-achieving-human.pdf
Authors : Hassan et al. 2018