This is an academic machine translation toolkit, in which the focus has been towards readability and changeability. We also have tried to make the algorithms as fast as possible, but please let us know if you have any suggestions or concerns regarding the toolkit. To get familiar with what you can do and how you can do it please read through this documentation.
To run the code you will need python 3.8+ (tested on v3.8.2) and PyTorch 1.6+.
This is the library that contains the implementation for our EACL 2021 paper "Better Neural Machine Translation by Extracting Linguistic Information from BERT". For updated aspect-augmented translation results and acquiring aspect vectors please look down this document.
We start with the project structure. In the highest level of the project, there are four main directories:
-
resources
: All the necessary resources for the project are expected to be loaded from theSFUTranslate/resources
directory. Please refrain from putting the resources anywhere else if you are planing to make a pull request. -
translate
: All the source code files are underSFUTranslate/translate
directory. If you are using an IDE to debug or run the code, don't forget to mark this directory as your sources directory. Otherwise, you will need to have the address ofSFUTranslate/translate
directory in your$PATH
environment variable to be able to run the code. Another way of running the code would be to runcd /path/to/SFUTranslate/translate && python trainer.py <path/to/config.yml>
. To test the trained model on the test set(s), you may use thetest_trained_model
script (e.g.cd /path/to/SFUTranslate/translate && python test_trained_model.py <path/to/config.yml>
). The last way of using the toolkit is to run the standalone experiment scripts inresources/exp-scripts
directory. Please note that the scripts will start by downloading and setting up an independent python environment, and they don't need your interaction while running. -
tests
: All the source code for unit test cases for different modules of the toolkit. If you are using an IDE to debug or run the code, don't forget to mark this directory as your test sources directory. -
docs
: Containing documentation related files.
The next sections will help you get more familiar with the code flow and training different models.
As stated earlier, the source code files are in the translate
directory (i.e. /path/to/SFUTranslate/translate/
).
The following figure depicts the general structure of the modules.
As of the current version, the translate
package contains the following sub-packages and classes.
models
general
providing a general wireframe for a newNMTModel
instance.aspects
the package containing implementations of aspect-augmented nmt model and its baselinesae_utils
aspect_extract_main
containers
extract_vocab
model
module
tester
trainer
sts
the package containing the implementation of the attentional sequence-to-sequence model using RNNsmodel
copy
the package containing the implementation of a simple copy model which only tokenizes and then detokenizes the input to produce the output.model
transformer
the package containing the implementation of vanilla transformer modelmodel
modules
optim
torch_model
utils
readers
the data read/preprocess methods and classes are placed in this packagedatasets
generic
dataset
data_provider
iterators
sequence_alignment
tokenizers
utils
scripts
the package containing all the dangling scripts in the toolkitcreate_pretrained_tokenizer_vocabulary
extract_common_vocab
mteval-v14.pl
(its required dependencies are placed besides it inSort
directory)wrap-xml.perl
utils
the package containing the utility functions used in the toolkitcontainers
evaluation
init_nn
optimizers
configuration
the class in charge of reading the config file and providing its content inside an easily accessible configuration object (cfg
).test_trained_model
the test script which loads an already trained model (pointed from the config file) and runs the beam search on it.trainer
the main script which loads the config file, creates the model and runs the training process. You may want to start looking into this script first, to get familiar with what can be done using this toolkit.
Here we present a complete schema for the config file containing all possible valid tags that can be put in it.
Please note that you may put some in and remove some from your config file, however, if it lacks the configurations
that are essential to your task you will face an error indicating that the required configuration value is not present.
In that case, please look at the config file and put the configuration tag with your desired value in it.
An example config file called nmt.yml
is already put in the resources
directory. You can modify and use it for running the project.
Nevertheless, you can create your own config file as a text file with a .yml
extension in the name and put your configurations in it.
Here is the configuration schema:
debug_mode: [true/false] if true tokenizer is deactivated and length filter is also applied to validation and test sets
src_lang: the bi-letter language identifier for source langugage
tgt_lang: the bi-letter language identifier for target langugage
dataset_name: the name of the torchtext datasetname [currently supported: "multi30k16", "iwslt17", "wmt19_de_en", "wmt19_de_fr"]
lowercase_data: [true/false] whether the dataset setences need to be lowercased or not
src_tokenizer: the tokenizer to be used for the source side of parallel data [possible values: "generic"|"moses"|"pre_trained"|"spacy"|"bert"]
tgt_tokenizer: the tokenizer to be used for the target side of parallel data [possible values: "generic"|"moses"|"pre_trained"|"spacy"|"bert"]
pad_token: special pad token used in data transformation
bos_token: special begin of sentence token used in data transformation
eos_token: special end of sentence token used in data transformation
unk_token: special unk token used in data transformation
propn_token: special proper noun indicator used in data transformation [currently this is deactivated]
max_sequence_length: maximum train data sentence length which is considered
max_vocab_src: maximum size of source side vocabulary
min_freq_src: minimum considrable source vocabulary, words with less freqency than this are replaced with ``unk_token``
max_vocab_tgt: maximum size of target side vocabulary
min_freq_tgt: minimum considrable target vocabulary, words with less freqency than this are replaced with ``unk_token``
extract_unk_stats: the flag which enables the code to perform type/token analysis on the ratio of <UNK> tokens in the current vocabulary settings
share_vocabulary: the flag which enables merging the source and target vocabulary into a single object, assigning unique ids to the same token in both source and target space
sentence_count_limit: the maximum number of sentences to be considered from the trainset. it will normally be used when different data fractions are intended to be compared. please note that the actual number of processed sentences can be lower than this number since empty lines are removed from training data.
aspect_vectors_data_address: the address of the pre-trained aspect vector extractors. You don't need to set any value for it if your model is not "aspect_augmented_transformer".
model_name: `sts` or `transformer` [also "aspect_augmented_transformer", "multi_head_aspect_augmented_transformer", "syntax_infused_transformer", and "bert_freeze_input_transformer"]
train_batch_size: the average number of words expected to be put in a batch while training [4000 to 5000 seem to be a reasonable defalt]
valid_batch_size: the average number of words expected to be put in a batch while testing [you dont need big numbers in this case]
maximum_decoding_length: maximum valid decoding length
emb_dropout: the droupout ratio applied to both source and target embedding layers of sts model
encoder_emb_size: the size of the embedding layer in encoder in sts model
encoder_hidden_size: the size of the RNN layer in encoder in sts model
encoder_layers: number of encoder RNN layers in sts model
encoder_dropout_rate: the dropout value applied to the encoder RNN in sts model
decoder_emb_size: the size of the embedding layer in decoder in sts model
decoder_hidden_size: the size of the RNN layer in decoder in sts model if this is not equal to twice the size of encoder output, a bridge network automatically mapps the encoder hidden states to the decoder hidden states.
decoder_layers: number of decoder RNN layers in sts model
decoder_dropout_rate: the dropout value applied to the decoder RNN in sts model
out_dropout: the dropout value applied to the output of attention before getting fed to the affine transformation in sts model
coverage_dropout: the droput value applied to the encoded representations in sts model before phi parameter is created
bahdanau_attention: [true/false] if not true Loung general attention is applied (in sts model)
coverage_phi_n: the N value for calculating the phi parameter in linguistic coverage calculation (in sts model)
coverage_required: [true/false] indicates whether coverage mechanism will be conisdered or not (in sts model)
coverage_lambda: the justification coefficient for the coverage loss when its added to the NLL loss (in sts model)
transformer_d_model: the size of the encoder and decoder layers of the transformer model
transformer_h: the number of the transformer model heads
transformer_dropout: the droput applied to the transformer model layers
transformer_d_ff: the size of the feed-forward layer in the transformer model
transformer_max_len: max expected length of input for positioanl encoding in tranformer model [you can safely leave it be as 5000 if you dont do document translation]
transformer_N: the number of encoder and decoder layers in the transformer model
transformer_loss_smoothing: the soothing factor in KL divergance loss calculation
transformer_opt_factor: the NoamOpt scheduler learning rate multiplicatin factor
transformer_opt_warmup: the number of NoamOpt scheduler warmup steps
share_all_embeddings: the flag indicating the embeddings are required to be shared in encoder/decoder modules
n_epochs: number of iterations over the all of training data
init_optim: the optimizer with which model is initialized [normally for just a few iterations]
init_learning_rate: the inital leaning rate of init_optim
init_epochs: the number of iterations that ``init_optim`` looks at the train data to initialize the parameters
optim: the optimizer with which the initialized model is tranied
learning_rate: the initial leaning rate of ``optim``
learning_momentum: the momentum used in ``optim`` if it is ``SGD``
grad_clip: [true/false] indicates whether graient clipping needs to be applied to the gradients
max_grad_norm: maximum valid gradient norm, the values greater than this value will be clipped
val_slices: number of expected validations in one epoch
lr_decay_patience_steps: number of patience validation slices after which the learning rate scheduler will decay the learning rate
lr_decay_factor: the learning rate decay factor which is being used by the learning rate scheduler
lr_decay_threshold: the improvement threshold in validation bleu score below which will be a decay indicator signal to the learning rate scheduler
lr_decay_min: min value of learning rate which the decayed learning rate might approach
update_freq: a non-negative integer indicating the number of batches, the gradient of which are accumulated before calling optimizer.step() [for more details see https://arxiv.org/pdf/1806.00187.pdf]
beam_size: the size of each bucket considered in beam search
beam_search_length_norm_factor: the length normalization factor in beam search based on Google's NMT system paper [https://arxiv.org/pdf/1609.08144.pdf]
beam_search_coverage_penalty_factor: the coverage penalty factor in beam search based on Google's NMT system paper [https://arxiv.org/pdf/1609.08144.pdf]
checkpoint_name: the name of the checkpoint which is being saved/loaded
In this section, we report the results of our Aspect Integrated NMT model along with the replicated baseline models.
The results here are the same as what was reported in the paper.
You can run our experiments using the standalone scripts in resources/exp-scripts/aspect_exps
.
The source code for Aspect Augmented NMT along with the replicated baselines is implemented in translate/models/aspects
.
In the result tables:
- #param represents the number of trainable parameters (size of BERT model parameters [110.5M] has not been added to the model size for the aspect augmented and bert-freeze models since BERT is not trained in these settings).
- runtime is the total time the training script has ran and includes time taken for creating the model, reading the data and iterating over the instances for all the epochs.
- We have used a single GeForce GTX 1080 GPU for M30k experiments and a single Titan RTX GPU for IWSLT and WMT experiments.
- The following results are created by lower-casing, translation, and true-casing afterwards. The cased aspect-augmneted experiment results are also available and although we have not reported them in the paper, you can find them in the last table.
- All the experiment results are calculated using
mteval-v14.pl
script.
val | test2016 | #param | runtime | |
---|---|---|---|---|
Vaswani et al. 2017 | 39.63 | 38.25 | 9.5 M | 84 min |
Sundararaman et al. 2019 | 40.03 | 38.32 | 13.9 M | 514 min |
Clinchant et al. 2019 (bert freeze) | 40.07 | 39.73 | 9.1 M | 99 min |
Aspect Augmented +M30k asp. vectors | 40.47 | 40.19 | 10.1 M | 104 min |
Aspect Augmented +WMT asp. vectors | 38.72 | 41.53 | 10.1 M | 102 min |
val | test2016 | #param | runtime | |
---|---|---|---|---|
Vaswani et al. 2017 | 31.07 | 30.29 | 9.4 M | 93 min |
Sundararaman et al. 2019 | 32.55 | 32.71 | 13.6 M | 504 min |
Clinchant et al. 2019 (bert freeze) | 33.83 | 33.15 | 9.0 M | 104 min |
Aspect Augmented +M30k asp. vectors | 34.45 | 34.42 | 9.9 M | 108 min |
Aspect Augmented +WMT asp. vectors | 34.73 | 34.28 | 9.9 M | 118 min |
dev2010 | tst2010 | tst2011 | tst2012 | tst2013 | tst2014 | tst2015 | #param | runtime | |
---|---|---|---|---|---|---|---|---|---|
Vaswani et al. 2017 | 27.69 | 27.93 | 31.88 | 28.15 | 29.59 | 25.66 | 26.76 | 18.4 M | 172 min |
Sundararaman et al. 2019 | 29.53 | 29.67 | 33.11 | 29.42 | 30.89 | 27.09 | 27.78 | 28.9 M | 1418 min |
Clinchant et al. 2019 (bert freeze) | 30.31 | 30.00 | 34.20 | 30.04 | 31.26 | 27.50 | 27.88 | 18.0 M | 212 min |
Aspect Augmented +IWSLT asp. vectors | 29.03 | 29.17 | 33.42 | 29.58 | 30.63 | 26.86 | 27.83 | 18.9 M | 214 min |
Aspect Augmented +WMT asp. vectors | 31.22 | 30.82 | 34.79 | 30.29 | 32.34 | 27.71 | 28.40 | 18.9 M | 211 min |
wmt_val | newstest2014 | newstest2015 | newstest2016 | newstest2017 | newstest2018 | newstest2019 | #param | runtime | |
---|---|---|---|---|---|---|---|---|---|
Vaswani et al. 2017 | 28.96 | 26.91 | 26.91 | 31.42 | 28.07 | 33.56 | 29.77 | 68.7 M | 35 h |
Sundararaman et al. 2019 | 28.56 | 27.80 | 26.93 | 30.44 | 28.63 | 33.87 | 30.48 | 93.8 M | 258 h |
Clinchant et al. 2019 (bert freeze) | 28.63 | 27.54 | 27.15 | 31.69 | 28.30 | 33.89 | 31.48 | 69.1 M | 33 h |
Aspect Augmented +WMT asp. vectors | 28.98 | 28.05 | 27.58 | 32.29 | 29.07 | 34.74 | 31.48 | 70.3 M | 46 h |
As we suggested earlier, we have also trained aspect-augmented translation models using cased data, however, the results were not ready in time to put in the paper. This section reports the cased aspect-augmented translation results.
To put more emphasis, we have used the tag lowercase_data: false
in the config files when training aspect vectors and
translation models and this is now the default in all configuration files under exp-configs/aspect_exps
.
We are also releasing only this set of German aspect vectors (trained with cased data) to be used in other tasks.
To replicate our experiments, please click on the dataset name on the first column of each table and download the linked aspect extractor module
(the pre-trained aspect extractor module will be used in models.aspects.model.AspectAugmentedTransformer
class and
you can take a look at it for better understanding).
Once downloaded, put the pre-trained aspect extractor module in .checkpoints
directory (if it doesn't exist create it) besides the translate
and resources
directories.
Now point the aspect_vectors_data_address
tag in the config file of your desired experiment to the aspect extractor module
(e.g. aspect_vectors_data_address: ../.checkpoints/aspect_extractor.de
) and run the trainer with the modified config file.
We retrain aspect extractors using cased data. First, lets look at the cased classification f-1 scores on the same aspect tag set (in word-level).
Dataset | CPOS | FPOS | WSH | #Tokens |
---|---|---|---|---|
M30k | 98.87 | 98.21 | 99.34 | 12822 |
IWSLT | 97.79 | 96.68 | 98.71 | 16760 |
WMT | 94.59 | 94.28 | 88.74 | 50666 |
Also, lets look at the sub-word level classification f-1 scores.
Dataset | CPOS | FPOS | WSH | SWP | #Tokens |
---|---|---|---|---|---|
M30k | 98.13 | 97.34 | 99.44 | 99.86 | 15836 |
IWSLT | 96.78 | 95.47 | 98.87 | 99.70 | 19559 |
WMT | 91.96 | 91.49 | 89.03 | 97.70 | 63259 |
Now, we report both cased Bleu and NIST scores from mteval-v14.pl
for the cased aspect-augmented NMT experiment.
German to English | val | test_2016_flickr | test_2017_flickr | test_2018_flickr | test_2017_mscoco |
---|---|---|---|---|---|
Using M30k aspect vectors | 41.82 / 7.88 | 41.48 / 7.98 | 38.07 / 7.33 | 33.40 / 6.97 | 30.50 / 6.21 |
German to English | dev2010 | tst2010 | tst2011 | tst2012 | tst2013 | tst2014 | tst2015 |
---|---|---|---|---|---|---|---|
Using IWSLT aspect vectors | 32.84 / 7.53 | 32.14 / 7.45 | 35.88 / 7.87 | 31.33 / 7.41 | 33.83 / 7.51 | 29.14 / 7.01 | 30.30 / 7.04 |
German to English | wmt_val | newstest2014 | newstest2015 | newstest2016 | newstest2017 | newstest2018 | newstest2019 | newstest2020 | newstestB2020 |
---|---|---|---|---|---|---|---|---|---|
Using WMT aspect vectors | 29.74 / 7.19 | 29.58 / 7.69 | 29.56 / 7.55 | 34.38 / 8.33 | 30.83 / 7.84 | 36.75 / 8.73 | 31.51 / 7.62 | 18.70 / 4.14 | 18.21 / 4.17 |
In this section, we put the experiment configurations of different models on different datasets.
You can pick the pre-trained models the result of which are posted here from the hyperlinks of the Pretrained Models
(the models are currently removed as they are pretty outdated!).
To use them you can simply put them in a directory named .checkpoints
created besides the translate
package (if it is a zipped file, unzip it there) and point the checkpoint_name
in the configuration file to the downloaded pre-trained model.
Each experiment will have a model file (ending in ".pt") in there with the exact same name mentioned in the table below.
The dataset with which the model has been trained will be automatically downloaded.
The configuration file with which the model was configured, can be downloaded by clicking on the experiment name link (first column of the table).
Experiment Name | Replication Script | Model | Task | Dataset | Devset/Testset | Language | Bleu Score (dev/test) | Model File | More Info |
---|---|---|---|---|---|---|---|---|---|
seq2seq_multi30k_de_en | replicate.sh | Seq2Seq | Translation | Multi30k2016 | multi30k/val; multi30k/test2016 | German2English | - | - | |
transformer_multi30k_de_en | replicate.sh | Transformer | Translation | Multi30k2016 | multi30k/val; multi30k/test2016 | German2English | - | - | |
seq2seq_iwslt_de_en | replicate.sh | Seq2Seq | Translation | IWSLT2017 | dev2010; tst201[0-5] | German2English | - | - | |
transformer_iwslt_de_en | replicate.sh | Transformer | Translation | IWSLT2017 | dev2010; tst201[0-5] | German2English | - | - | |
seq2seq_wmt_de_en | replicate.sh | Seq2Seq | Translation | WMT19 | valid (1% of train data); newstest201[4-9] | German2English | - | - | |
transformer_wmt_de_en | replicate.sh | Transformer | Translation | WMT19 | valid (1% of train data); newstest201[4-9] | German2English | - | - |
Second column of the provided results table contains the replication scripts for each experiment
(in cases of multiple test sets, the script only tests on the default test set provided by torchtext
).
Replication of our results is quite easy, just download the replicate.sh
script for the experiment you are interested in and run bash <script_name>
.
The script will create a virtual environment, install SFUTranslate along with all of its dependencies, trains the model with our provided configuration, and test and print the dev/test scores.
You can also modify these scripts as you desire.
The validation results during the training as well as the samples created from the model in each epoch, can be found in train.output
during and after training (it might take a bit till they actually appear in the file while training is in process as python buffers the content and writes them in batches).
The final validation and test results can be found in test.output
. You may also track the training progress by running the command cd /path/to/SFUTranslate/translate && tail -f train_progress_bars.log
.
The output files of our experiment are also uploaded besides each replication script in a folder with the same name, so you can compare and verify your replicated results.
In this section, we go over the required libraries and how the project is dependant on each so that in case of the need
to change (or remove) any of them, you know how and where to look for them.
The descriptions here are essentially describing the content of requirements.txt
file besides this Readme document.
-
pyyaml
the library needed to read the configurations, parse them and access their parsed values. You may need to look at the content oftranslate.configuration
class for the use case of this library. -
pytorch
the backend library which provides the neural network related functionality and classes. -
torchtext
the data provider which downloads and loads the demanded dataset. -
sacrebleu
the evaluation package used for computing the Bleu scores in validation and test set score computation. -
sacremoses
the tokenizer/detokenizer implementation of Moses. -
spaCy
the pre-processing toolkit used for normalization and tokenization ofEnglish
,German
,Spanish
,Portuguese
,French
,Italian
, andDutch
. However, to make the library able to process each of the languages you will need to download its resources for spaCy using the following lines (you should simply copy the download line and past it into the command line to get executed). You don't have to use spacy asreaders.tokenizers
providers a number of different tokenizers you can choose among.
python -m spacy download en
python -m spacy download de
python -m spacy download es
python -m spacy download pt
python -m spacy download fr
python -m spacy download it
python -m spacy download nl
-
transformers
the library providing pre-trained bert and bert-tokenizer models. -
unidecode
the library used for transliteration of unicode text into ascii (inreaders.sequence-alignment
) script. -
tokenizers
the tokenizer library providing state-of-the-art word-piece tokenization and pre-training models for both sub-word and word-piece. -
textblob
the library used for sentiment analysis feature extraction inmodels.aspects.extract_vocab
. -
nltk
the provider of lesk algorithm implementation for word-sense feature extraction inmodels.aspects.extract_vocab
. -
scikit-learn
the classification metrics/report provider library for aspect extractor -
numpy
the library used to create temporary tensors on cpu before copy to torch and gpu. -
Utility libraries [
tqdm
andxml
] the libraries that provide simple utility functionality.
If you need help regarding the toolkit or you want to discuss your comments, you are more than welcome to email Hassan S.Shavarani.