Code and data for Character-level Chinese-English Translation through ASCII Encoding.
@InProceedings{en2wubi,
author = {Nikolov, Nikola and Hu, Yuhuang and Tan, Mi Xue and Hahnloser, Richard H.R.},
title = {Character-level Chinese-English Translation through ASCII Encoding},
booktitle = {Proceedings of the Third Conference on Machine Translation},
month = {October},
year = {2018},
address = {Belgium, Brussels},
publisher = {Association for Computational Linguistics},
pages = {10--16},
url = {http://www.aclweb.org/anthology/W18-64002}
}
The data used to produce the paper and model results is available here.
To convert your data from Chinese to Wubi, follow the instructions in the en2wubi package.
Follow the instructions in the Fairseq library for preprocessing, training and evaluation. To train the same LSTM model that we use the paper, pass --arch lstm
to train.py
; for the FConv model pass --arch fconv_iwslt_de_en
.
On the subword-level, you need to additionally learn and apply subword segmentation rules on the dataset. We use the subword-nmt library for subword segmentation.
Follow the instructions in this repository for preprocessing and train a bilingual char2char model using char2char/train_bi_char2char.py
.
To compute BLEU, download and run multi-bleu.perl as:
perl multi-bleu.perl reference.txt < model_output.txt
When evaluating en2wb vs. en2cn, you can use our scripts to convert the Chinese results to Wubi before computing BLEU, to make the scores more comparable.
Nikola I. Nikolov and Yuhuang Hu