This repository contains scripts helpful for reproducing the results in:
Lukas Edman, Antonio Toral, and Gertjan van Noord. 2020. Low-Resource Unsupervised NMT: Diagnosing the Problem and Providing a Linguistically Motivated Solution. The 22nd Annual Conference of the European Association for Machine Translation (EAMT 2020).
- Python 3
- PyTorch (tested on 1.2)
- Moses
- fastBPE
- UDPipe (tested on 1.2, with models from 2.4)
- StanfordNLP parser
- Dependency-based word2vec
- fastText
- VecMap
- UnsupervisedMT
The scripts here assume these are saved (or soft-linked) in the tools
directory on the same level as the scripts
directory, with the following subdirectories:
tools/moses/
tools/fastBPE/
tools/udpipe/
tools/word2vecf/
tools/fastText/
tools/vecmap/
tools/unmt/ # points to UnsupervisedMT/NMT/
From the scripts
directory, run:
./pipeline.sh normal 1M
This will run all preprocessing steps from downloading the data up to mapping embeddings with vecmap. The "1M" specifies using 1 million sentences per language.
To train the NMT system, run:
./nmt_system ../data/mono/1M.en ../data/mono/1M.de toku.true.bpe_60000 1M test_run
This will train an NMT system on 1 million sentences per language, using the pretrained embeddings from the previous step.
- Depending on your system, some steps of the preprocessing pipeline may need to be run individually. This is especially the case for the dependency parsing, where we recommended splitting the data into 10000-sentence chunks and dependency parsing the chunks in parallel to speed up the parsing time.
- To evaluate BLI precision at 5 and 10, replace
eval_translation.py
from VecMap with our modified version included intools/vecmap/
, and use the flag:--p_at 10
.