wmt-en2wubi

Code and data for Character-level Chinese-English Translation through ASCII Encoding.

Cite the paper

@InProceedings{en2wubi,
  author    = {Nikolov, Nikola  and  Hu, Yuhuang  and  Tan, Mi Xue  and  Hahnloser, Richard H.R.},
  title     = {Character-level Chinese-English Translation through ASCII Encoding},
  booktitle = {Proceedings of the Third Conference on Machine Translation},
  month     = {October},
  year      = {2018},
  address   = {Belgium, Brussels},
  publisher = {Association for Computational Linguistics},
  pages     = {10--16},
  url       = {http://www.aclweb.org/anthology/W18-64002}
}

Training/Evaluation Data and Results

The data used to produce the paper and model results is available here.

Converting Chinese to Wubi

To convert your data from Chinese to Wubi, follow the instructions in the en2wubi package.

Instructions for reproducing the results

Word- and subword- level

Follow the instructions in the Fairseq library for preprocessing, training and evaluation. To train the same LSTM model that we use the paper, pass --arch lstm to train.py; for the FConv model pass --arch fconv_iwslt_de_en.

On the subword-level, you need to additionally learn and apply subword segmentation rules on the dataset. We use the subword-nmt library for subword segmentation.

Character-level

Follow the instructions in this repository for preprocessing and train a bilingual char2char model using char2char/train_bi_char2char.py.

Evaluation

To compute BLEU, download and run multi-bleu.perl as:

perl multi-bleu.perl reference.txt < model_output.txt

When evaluating en2wb vs. en2cn, you can use our scripts to convert the Chinese results to Wubi before computing BLEU, to make the scores more comparable.

Contacts

Nikola I. Nikolov and Yuhuang Hu

duguyue100/wmt-en2wubi