"Exploring Diversity in Back-Translation for Low-Resource Machine Translation" (Burchell et al., NAACL 2022)
The scripts in this repo were written to run on CSD3 using the ampere GPU nodes. Note that all data and models are saved to /rds/user/$USER/hpc-work
.
All scripts are in scripts/
directory.
- Miniconda (Python>=3.8)
- Weights and Biases
- Run
conda env create -f diversity-env.yml
- Activate environment by running
conda activate diversity-env
- Config Weights and Biases API key, then run
export WANDB_API_KEY=<your_key>
- Get Stanford Parser by running
bash installation/get-stanford-parser.sh
- Start an interactive session with one GPU and activate environment again
- Install fairseq and sentencepiece by running
bash installation/install-fairseq-spm.sh
- Close GPU session
- Run
bash prepare-parallel-data.sh
to download and prepare the parallel training data, development sets, and test sets. - Alter project name in all
cluster-scripts/submit-*
scripts to your own project (by#SBATCH -A
) - Run
bash submit-parallel-only-ensembles.sh
to train four ensembles of four models, one for each language pair and direction. - Test ensemble performance with
bash test-ensemble.sh DICT_DIR MODEL{1..4}
- e.g.
bash test-ensemble.sh ../datasets/parallel-data/tr-en/enc-tren/ ../models/tren/tren-parallel-only/tren-parallel-only-*/checkpoint_best.pt
- Results are in each ensemble's directory e.g.
../models/tren/tren-parallel-only/results
- e.g.
- Run
bash split-and-submit-parse-parallel.sh {tr|is}
to shard the English side of the parallel datasets and submit them to the cluster for parsing. - The notebook
CreateSyntaxGroups.ipynb
contains instructions to create the three syntax groups. - Prepare the syntax groups for fairseq training by running
bash prepare-syntax-group-data.sh DATA_DIR SRC_DATA TRG_DATA group{0,1,2}
- Run
bash prepare-monolingual-data.sh
to download and prepare the monolingual data. - For each language pair, back-translate the monolingual data by runnning
bash submit-backtranslation-jobs.sh DATA DICT_DIR MODEL{1..4}
- e.g.
bash submit-backtranslation-jobs.sh ../datasets/monolingual-data/en/sp/mono.9M.isen-sp.en.gz ../datasets/parallel-data/is-en/enc-enis/ ../models/enis/enis-parallel-only/enis-parallel-only-*/checkpoint_best.pt
- e.g.
- For syntax models, run
for i in 0..3; do for j in 0..2; do for l in isen tren; do sbatch cluster-scripts/submit-translate-mono.slurm ../datasets/back-translation/$l/mono.shard.is$i ../datasets/parallel-data/${l:0:2}-en/enc-$l/ ../models/$l/$l-syntax-finetune/$l-syntax-finetune-*group$j/checkpoint_best.pt syntax$j; done; done; done
- Once back-translation is complete, create parallel plus back-translation datasets for each language pair by running
bash prepare-all-back-translated-data FINAL_SRC FINAL_TRG
- Run
bash submit-parallel-plus-bt-ensembles.sh
to train ensembles of four models for each of the language pairs and different diversity datasets. - Test ensemble performance as before with
bash test-ensemble.sh DICT_DIR MODEL{1..4}
- e.g.
bash test-ensemble.sh ../datasets/parallel-plus-bt/tren/nucleus3M/enc-tren/ ../models/tren/tren-parallel-plus-bt-nucleus3M/tren-parallel-plus-bt-nucleus3M-*/checkpoint_best.pt
- e.g.
- Run
bash calculate-diversity-metrics.sh INPUT_DATA LANG_FROM_LANG CORPUS_SIZE TYPE
to calculate the diversity metrics over the back-translated corpora.- e.g.
bash calculate-diversity-metrics.sh ../datasets/parallel-plus-bt/entr/mono.beam-tr-bt.en.gz en-from-tr 9000000 beam3M
- Note that results will be saved in
diversity-metrics/${LANG_FROM_LANG}/
- e.g.