Codes and pre-trained models for the Chinese-to-English machine translation benchmark.
Fistly, clone this repository and the related submodules:
git clone https://github.com/nusnlp/c2e-mt-benchmark.git
cd c2e-mt-benchmark
git submodule update --init --recursive
Secondly, go to each subdirectories under tools/*
and follow the setup/installation instructions accordingly.
Finally, download and unpack the pre-trained models to the models/
subdirectory:
cd models/
wget http://sterling8.d2.comp.nus.edu.sg/~christian/c2e-mt-benchmark/pretrained.tar.gz
tar -xvzf pretrained.tar.gz
cd ..
The input is a plain text file containing Chinese sentences, one sentence per line. The input file is passed through the following pipeline:
- Chinese word segmentation, by running
scripts/segment.sh < input > input.seg
- Translation (ensure that Theano flags are set as environment variables, replace
nist
withunpc
for models trained on UN Parallel Corpus)- without re-ranking:
scripts/translate-norerank.sh nist input.seg output [device(s)]
, where the device(s) include "gpu0", "gpu0 gpu1", or the default "cpu" - with re-ranking:
scripts/translate-rerank.sh nist input.seg output [device(s)]
- without re-ranking:
- Recasing, by running
scripts/recase.sh < output > output.rc
- Detokenization, by running
perl scripts/detokenizer.perl -l en < output.rc > output.detok
The outputs/
subdirectory contains the translation outputs produced by our models.
The comparisons between the NIST test set results in BLEU achieved by our model and those achieved by prior published work are available here.
If you use the pre-trained models and settings from this repository, please cite the following paper:
Hadiwinoto, Christian and Ng, Hwee Tou (2018). Upping the ante: Towards a better benchmark for Chinese-to-English machine translation. In Proceedings of the 11th edition of the Language Resources and Evaluation Conference. (pp. 16--23). Miyazaki, Japan.