/Transformers

Pytorch Implementation of Transformers Explained with Comments

Primary LanguagePythonMIT LicenseMIT

Neural Network Hacks Tried

Xavier Initialization : All layers of the transformers initialized with xavier uniform. Xavier Uniform
Gradient Clipping: Gradient clipping to avoid exploding gradient problem. Gradient Clipping
SGD with optimizer: Got from official pytorch implemenation of transformers. SGD optimizer and scheduler

Hacks to Try

Adam Optimizer with scheduler: As mentioned in the transformers paper. Transformers
Beam Search with length normalization: Beam search avoids neural text Degeneration. Beam Search
Avoid Neural Degenaration with Nucleus Sampling: Nucleus Sampling works better than Beam Search. Nucleus Sampling
Optimal No. of Heads: Based on paper Are Sixteen Heads Really Better than One? Paper

Transformers

Pytorch Implementation of Transformers Explained with Comments

Introduction

The Transformer are based solely on attention mechanisms, dispensing with recurrence and convolutions entirely. These models are superior in quality while being more parallelizable and requiring significantly less time to train. In this document we will describe the transformer model completely and finally make our transformer model in PyTorch and test it on Cornell Movie Dialogs Corpus to show some interesting result.

.

Features of Transformers

Not Sequential



The whole input is fed into transformer at once, whereas for sequential models like rnns, one at a time.

Self Attention

As the model processes each word (each position in the input sequence), self attention allows it to look at other positions in the input sequence for clues that can help lead to a better encoding for this word.


There is a high correlation between 'man' and 'battle' and 'man' and 'struggle' which is captured by self attention.

Multi Head Attention

This gives the model the advantage of focusing on different words h ways (h is the number of heads). It broadens the model’s capability to focus on different positions and gives the attention layer multiple different representations.

In one head 'heroes' is attending to 'powers' and 'graced'
In another head 'heroes' is attending to 'path' and 'choose'

Architecture

The full model architecture of the transformer. (Image source: Fig 1 & 2 in Vaswani, et al., 2017.)

Input Embeddings

First we encode every word into embedding vector i.e choose glove embedding, and since transformer accepts sentences so we define the Max Length which is no. of word embedding to be passed. Finally, we process the input in batches so a final tensor of Embedding Dimension * Max Length * Batch Size is processed.

The input to the transformer is embedding dimension times Max length and we give batches of those.

Positional Encoding

This technique is used because there is no notion of word order (1st word, 2nd word, ..) in the proposed architecture. All words of input sequence are fed to the network with no special order or position (unlike common RNN or ConvNet architectures), thus, model has no idea how the words are ordered. Consequently, a position-dependent signal is added to each word-embedding to help the model incorporate the order of words.

A real example of positional encoding with a toy embedding size of 4 (The Illustrated Transformer by Jay Allamar)

Multi-Head Attention

The General Framework of Attention is given by

Attention(Q,K,V) = Softmax(Q KT / dh)V

where Q is Query Vector, K is Key Vector and V is Value vector.

Here d_h is embedding size/h  and h is no. of attention heads.

In case of Multi-Head attention we have, For each head i: headi = Attention(QWiQ, KWiK, VWiV)

Finally all the attention head is concatenated and is passed through linear layer of same size as input so that the dimensions do not alter. We computed ’h’ different attention heads. Concatenation of heads is not enough to transfer information between heads and so the concatenated heads are passed through the linear layer.

Residual Learning

We are learning what’s left of (residual), without learning a new representation. You are learning the ’remaining’ only. If the block doesn’t learn anything, then your F(X) would be 0, and that it what makes the training go much faster, since learning a completely new representation is omitted. Therefor , the model can default to using the identity function if the layer is not beneficial.

Either learn something useful, or don’t learn anything!

Layer Normalization

In order to prevent the values of the outputs from becoming bigger. We have performed a lot of operations which may cause the values of the layer output to become bigger.So we use Layer Norm to normalize them back again.

Masked Multi-Head Attention

For self-attention, we don’t want our decoder to attend to future word. Otherwise, the model will cheat and learn to look at future words. At testing time, we don’t have future words! We are predicting one word at a time(running the decoder for a number of timesteps, just like an LSTM at testing time). So this will be incompatible during testing(inference). Therefore, the decoder is only allowed to attend to earlier positions. During testing time, it can only attend to what has been generated so far. So we need to resemble the testing time scenario during training as well.

Install

This project requires Python and the following Python libraries installed:

If you do not have Python installed yet, it is highly recommended that you install the Anaconda distribution of Python, which already has the above packages and more included.

Code

dataset.py: Reads the implemented tokenized dataset. (Cornell Movie dialog Corpus)
model.py: Generic implementation of pytorch transformers.
train.py: Training Loop
config.py: Configuration of the model
chat.py: Loads the model and allows interactive chatting on terminal.