This directory contains our source codes and dataset for "Data Augmentation Using Back-translation for Context-aware Neural Machine Translation" presented in DiscoMT 2019.
2019/12/15: Performed a large refactering
- python 3.6.8
- tensorflow 1.12.0
- numpy 1.16.0
- sentencepiece
- jq
- mecab
README.md # This readme
activate
discourse_test_set/
corpus_preparation/
scripts/ # language-pair-independent scripts
corpus_preparation/ # Original corpus are downloaded here
experiments/
l1-l2/ # Template of experiment on a directed language pair
global_config.json # Config for this lang-pair
data/
scripts/ # lang-pair-specific scripts
backward/ # The back-translation model
forward/ # Forward-translation models
en-ja # Instance of l1-l2 (same below)
ja-en
en-fr
fr-en
./scripts/
transformer/ # Transformer scripts
mosesdecoder/
preprocess-scripts/
corpus_formatters/ # scripts to extract sentences from raw corpus
project/ # Scripts used in the experiments
all.sh # Main shell script run from experiments/l1-l2
./experiments/l1-l2/
data/
raw/ # Sentences extracted from the original corpora
preprocessed/ # Preprocessed corpora
concat/ # Concatenated preprocessed corpora
back_translated/ # Back-translated monolingual corpus
backward/
model_config.json # Config of the Transformer for back-trans
model_config.py
log/ # Log of train and test of this Transformer
forward/
1-to-1/ # Sentence-level models
0/ # Transformer trained without data-augmentation
500/ # trained with 500k pseudo data
1000k/ # trained with 1000k pseudo data
...
2-to-1/
2-to-2/
source ./activate
This sets the global root's path into an env var $CMTBT_GROOT. You have to do this every time you open a new terminal window.
./corpus_preparation/download_data.sh
By this command, the following corpora are downloaded and sentences are extracted (like removing xml tags and gathering sentences in multiple files into a single file)
- IWSLT2017 en-ja and en-fr
- Europarl v7
- fr sentences are gathered into a single file
- Japanese diet corpus
- Bookcorpus
- Download is done using https://github.com/soskek/bookcorpus
- This would take much time
Note: Blank lines are inserted at the document boundaries
Procedure to conduct training and evaluation of back-translation and data-augmented forward translation models.
By default, environments for 4 language pairs are prepared:
- en->ja : ./experiments/en-ja
- ja->en : ./experiments/ja-en
- en->fr : ./experiments/en-fr
- fr->en : ./experiments/fr-en
You can make your own by copying
- ./experiments/l1-l2/global_config.json
- ./experiments/l1-l2/scripts/preprocess.sh
- ./experiments/l1-l2/scripts/copy_dataset.sh
into the new directory and modify them as you like.
Move to the root directory for experiment of a language pair (ja-en in the following examples)
cd ./experiments/ja-en
global_config.json is placed in every language pair's root directory. You can edit it to modify settings like vocabulary size, variation in pseudo data size etc.
Note: by default, batch capacity (maximum number of tokens in a batch) is 16384, which is compatible with GPUs with 24GB RAM in total (e.g. two Titan X GPUs).
pwd
#output: /path/to/global_root/experiments/ja-en
../../scripts/project/all.sh 1
IWSLT2017 en-ja and Japanese diet corpus are copied from /global_root/corpus_preparation/
../../scripts/project/all.sh 2
../../scripts/project/all.sh 3
../../scripts/project/all.sh 4
../../scripts/project/all.sh 11 12
../../scripts/project/all.sh 13
../../scripts/project/all.sh 21
../../scripts/project/all.sh 22 23