The machine translation mechanism translates texts automatically between different natural languages, and Neural Machine Translation (NMT) has gained attention for its rational context analysis and fluent translation accuracy. However, processing low-resource languages that lack relevant training attributes like supervised data is a current challenge for Natural Language Processing (NLP). We incorporated a technique known as active learning with the NMT toolkit Joey NMT to reach sufficient accuracy and robust predictions of low-resource language translation. With active learning, a semi-supervised machine learning strategy, the training algorithm determines which unlabeled data would be the most beneficial for obtaining labels using selected query techniques. We implemented two model-driven acquisition functions for selecting the samples to be validated. This work uses transformer-based NMT systems; baseline model (NMT-1), fully trained model (NMT-2), active learning least confidence based model (NMT-3), and active learning margin sampling based model (NMT-4) when translating English to Hindi. The Bilingual Evaluation Understudy (BLEU) metric has been used to evaluate system results. The BLEU scores of NMT- 1, NMT- 2, NMT- 3 and NMT- 4 systems are 21, 22, 23 and 24, respectively. The findings demonstrate that active learning techniques improve the quality of the translation system.
Joey NMT was initially developed and is maintained by Jasmijn Bastings (University of Amsterdam) and Julia Kreutzer (Heidelberg University), now both at Google Research. Mayumi Ohta at Heidelberg University is continuing the legacy.
Active Learning for Neural Machine Translation implements the following features (aka the minimalist toolkit of NMT 🔧):
- Transformer Encoder-Decoder
- BPE tokenization
- BLEU, PPL evaluation
- Beam search with length penalty and greedy decoding
- Human-in-the-loop or non-Integrative for Query mechanism
- Random Strategy
- Margin Query Strategy
- Least Confidence Strategy
- Customizable initialization for Active Learning Dataset
- Learning curve plotting for BLEU and PPL
- Scoring hypotheses and references
Active Learning for Neural Machine Translation is built on JoeyNMT and PyTorch. Please make sure you have a compatible environment. We tested Joey NMT 2.0 with
- python 3.9
- torch 1.12.1
- cuda 11.3
⚠️ Warning When running on GPU you need to manually install the suitable PyTorch version for your CUDA version. For example, you can install PyTorch 1.11.0 with CUDA v11.3 as follows:$ pip install --upgrade torch==1.11.0 --extra-index-url https://download.pytorch.org/whl/cu113
You can run the code from source.
Clone this repository:
$ git clone https://github.com/kritisingh24/joeynmt.git
$ cd joeynmt
[Optional] For fp16 training, install NVIDIA's apex library:
$ git clone https://github.com/NVIDIA/apex
$ cd apex
$ pip install -v --disable-pip-version-check --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" ./
- Install conda from Anaconda
- To build a reproducible environment follow below
$ conda create --name test39 python=3.9
$ conda activate test39
$ conda install pytorch torchvision torchaudio cudatoolkit=11.3 -c pytorch
$ pip install -r requirement.txt
Active Learning for Neural Machine Translation has 4 modes: train
, test
, translate
and active_learning
, and all of them takes a
YAML-style config file as argument.
You can find examples in the configs
directory.
Most importantly, the configuration contains the description of the model architecture (e.g. number of hidden units in the encoder RNN), paths to the training, development and test data, and the training hyperparameters (learning rate, validation frequency etc.).
📝 Info Note that subword model training and joint vocabulary creation is not included in the 3 modes above, has to be done separately. We provide a script that takes care of it:
scritps/build_vocab.py
.$ python scripts/build_vocab.py configs/small.yaml --joint
For training active learning, run
$ python main.py active_learning configs/baseline.yaml
This will train a baseline model on the training data, and keep aside a section of active learning data, validate on validation data, and store
model parameters, vocabularies, validation outputs. All needed information should be
specified in the data
, training
and model
section of the config file (here
configs/baseline.yaml
).
model_dir/
├── *.ckpt # checkpoints
├── *.hyps # translated texts at validation
├── config.yaml # config file
├── spm.model # sentencepiece model / subword-nmt codes file
├── src_vocab.txt # src vocab
├── trg_vocab.txt # trg vocab
├── train.log # train log
└── validation.txt # validation scores
💡 Tip Be careful not to overwrite
model_dir
, setoverwrite: False
in the config file.
For training, run
$ python main.py train configs/fully_trained.yaml
This will train a model on the training data, validate on validation data, and store
model parameters, vocabularies, validation outputs. All needed information should be
specified in the data
, training
and model
section of the config file (here
configs/baseline.yaml
).
This mode will generate translations for validation and test set (as specified in the
configuration) in model_dir/out.[dev|test]
.
$ python -m joeynmt test configs/small.yaml --ckpt model_dir/avg.ckpt
If --ckpt
is not specified above, the checkpoint path in load_model
of the config
file or the best model in model_dir
will be used to generate translations.
You can specify i.e. sacrebleu options in the
test
section of the config file.
💡 Tip
scripts/average_checkpoints.py
will generate averaged checkpoints for you.$ python scripts/average_checkpoints.py configs/small.yaml --joint
If you want to output the log-probabilities of the hypotheses or references, you can
specify return_score: 'hyp'
or return_score: 'ref'
in the testing section of the
config. And run test
with --output_path
and --save_scores
options.
$ python -m joeynmt test configs/small.yaml --ckpt model_dir/avg.ckpt --output_path model_dir/pred --save_scores
This will generate model_dir/pred.{dev|test}.{scores|tokens}
which contains scores and corresponding tokens.
📝 Info
- If you set
return_score: 'hyp'
with greedy decoding, then token-wise scores will be returned. The beam search will return sequence-level scores, because the scores are summed up per sequence during beam exploration.- If you set
return_score: 'ref'
, the model looks up the probabilities of the given ground truth tokens, and both decoding and evaluation will be skipped.- If you specify
n_best
>1 in config, the first translation in the nbest list will be used in the evaluation.
This mode accepts inputs from stdin and generate translations.
-
File translation
$ python -m joeynmt translate configs/small.yaml < my_input.txt > output.txt
-
Interactive translation
$ python -m joeynmt translate configs/small.yaml
You'll be prompted to type an input sentence. Joey NMT will then translate with the model specified in
--ckpt
or the config file.💡 Tip Interactive
translate
mode doesn't work with Multi-GPU. Please run it on single GPU or CPU.
📝 Info For interactive translate mode, you should specify
pretokenizer: "moses"
in the both src's and trg'stokenizer_cfg
, so that you can input raw sentence. ThenMosesTokenizer
andMosesDetokenizer
will be applied internally. For test mode, we used the preprocessed texts as input and setpretokenizer: "none"
in the config.
Pre-processing with Moses decoder tools as in this script.
The processed dataset is present at GDrive. This contains all the configuration and vocabulary on which the below model results have been acquired.
Model | Architecture | tok | BLEU dev | BLEU test | #params | download |
---|---|---|---|---|---|---|
baseline | Transformer | subword-nmt | 16.55 | 18.26 | 19M | enhi_transformer_t2_baseline.zip (217MB) |
fully_trained | Transformer | subword-nmt | 23.44 | 22.56 | 19M | enhi_transformer_t2_fully_trained.zip (219MB) |
margin | Transformer | subword-nmt | 24.25 | 23.20 | 19M | enhi_transformer_t2_margin.tar.gz (216MB) |
least_confidence | Transformer | subword-nmt | 24.11 | 23.36 | 19M | enhi_transformer_t2_least_confidence.tar.gz (215MB) |