Going Back and Forth: A Study into Back Translation for Data Augmentation in Neural Machine Translation
The following repository contains the code and data used in the paper "Going Back and Forth in Neural Machine Translation", written in submission for the course "Natural Language Processing 2" at the University of Amsterdam.
We investigate the use of back translation as a data augmentation technique for neural machine translation. We take a closer look at the effects of back translation on the quality of the generated data and the performance of the model.
- Does BT perform better for low-resource domain adaptation than fine-tuning on a limited parallel corpus? Furthermore, are they complementary?
- How does data selection improve backtranslation and MT performance? In particular, we aim to enhance the quality of synthetic training data by using a diversity metric. This metric filters the synthetic source-side data to select the most diverse and informative samples, based on type-token ratio. This approach seeks to balance the trade-off between data quality and quantity, aiming to reduce noise and redundancy in the training data and thereby improve overall machine translation performance.
- How can backtranslation be optimised or extended for this low-resource scenario? Iterative backtranslation has been shown to an effective approch in order to boost model performance \cite{hoang2018iterative}. We explore its effects and provide an additional baseline.
- Can we utilize current SOTA NMT systems in order to adapt less intense past NMT models? We hypothesize that introducing such teacher-student relationship to the backtranslation pipeline can improve the performance of older systems and make them more viable within target applications.
The code is structured as follows:
.
|-- Data # Contains the data used in the project
|-- scripts # Contains the main scripts for data generation and training using fairseq
|-- utils # Contains the main model and data processing utilities
|-- tests # Contains generated translations from the test set and more
├── notebooks
│ ├── data_filtering.ipynb # Data filtering and selection notebook
│ ├── notebook.ipynb # Main working notebook
│ └── results.ipynb # Results and analysis notebook
|-- train.py # Main training script
|-- generate.py # Data generation script
|-- requirements.txt # Required packages
|-- calculate_bleu.py # BLEU and COMET score calculation utility
Most of the available data is stored in the Data
directory. For reproductions, we also provide sample generated machine translations in the tests
directory.
- Python 3.x
- See
requirements.txt
for required Python packages
- Clone the repository
- Install the required packages with
pip install -r requirements.txt
- Run
python train.py
to train the model; additional arguments can be found in the script by runningpython train.py --help
- Run
python generate.py
to generate new data; additional arguments can be found as above