Neural-Machine-Translation-with-BERT

Modified re-implementation of paper published in ICLR 2020 used to translate English to Dutch

Pre-processing and Tokenization

Data from the translation dictionary is cleaned and converted to a Dataframe.
Text is converted to unicode, unnecessary whitespaces are removed, and each word in the dictionary is assigned an integer ID.
The class 'BasicTokenizer' tokenizes based on whitespace and punctuation, and lower-cases the text. It supports tokenizations for CJK characters.
The class 'WordpieceTokenizer' takes as input the tokens from BasicTokenizer and implements WordPiece tokenization.
The class 'FullTokenizer' integrates both tokenizers and provides a full tokenization pipeline. It also includes functions to map tokens to IDs and vice versa.

Positional encoding vectors are calculated and added to the embedding vector.
Padding mask and look-ahead mask are used.
Multi-head attention using scaled dot-product attention is employed in the model layers, which is followed by a feed forward neural network.

The Decoder layers are built using the custom architecture of masked multi-head attention and feed forward network sublayers which are connected via layer normalization.
Multiple Decoder layers are stacked to build the Decoder.
In multi-head attention(with padding mask), V (value) and K (key) receive the encoder output as inputs. Q (query) receives the output from the masked multi-head attention sublayer.

A transformer consists of the encoder, decoder, and a final linear layer. The output of the decoder is the input to the linear layer and its output is returned.
During training, Adam optimizer is used and custom learning rate scheduling as per the formula in 'Attention is All You Need' paper is implemented.
Checkpointing is also implemented during training. Teacher forcing is used in to train the model.

References: