/100-Days-of-NLP

Primary LanguageJupyter NotebookMIT LicenseMIT

emb

There is nothing magic about magic. The magician merely understands something simple which doesn’t appear to be simple or natural to the untrained audience. Once you learn how to hold a card while making your hand look empty, you only need practice before you, too, can “do magic.” – Jeffrey Friedl in the book Mastering Regular Expressions

Note: Please raise an issue for any suggestions, corrections, and feedback.

Most of the code samples are done using Jupyter notebooks (using colab). So each code can be run independently.

The following topics have been explored:

Note: The difficulty level has been assigned according to my understanding.

NLP Concepts

Tokenization Word Embeddings - Word2Vec Word Embeddings - GloVe Word Embeddings - ELMo
RNN, LSTM, GRU Packing Padded Sequences Attention Mechanism - Luong Attention Mechanism - Bahdanau
Pointer Network Transformer GPT-2 BERT
Topic Modeling - LDA Principal Component Analysis (PCA) Naive Bayes Data Augmentation
Sentence Embeddings

Tokenization



The process of converting textual data to tokens, is one of the most important step in NLP. Tokenization using the following methods has been explored:

Word Embeddings - Word2Vec



A word embedding is a learned representation for text where words that have the same meaning have a similar representation. It is this approach to representing words and documents that may be considered one of the key breakthroughs of deep learning on challenging natural language processing problems.

emb

Word2Vec is one of the most popular pretrained word embeddings developed by Google. Depending on the way the embeddings are learned, Word2Vec is classified into two approaches:

  • Continuous Bag-of-Words (CBOW)
  • Skip-gram model

word2vec arch

Word Embeddings - GloVe



GloVe is another commonly used method of obtaining pre-trained embeddings. GloVe aims to achieve two goals:

  • Create word vectors that capture meaning in vector space
  • Takes advantage of global count statistics instead of only local information

Word Embeddings - ELMo



ELMo is a deep contextualized word representation that models:

  • complex characteristics of word use (e.g., syntax and semantics)
  • how these uses vary across linguistic contexts (i.e., to model polysemy).

These word vectors are learned functions of the internal states of a deep bidirectional language model (biLM), which is pre-trained on a large text corpus.

elmo arch

RNN, LSTM, GRU



Recurrent networks - RNN, LSTM, GRU have proven to be one of the most important unit in NLP applications because of their architecture. There are many problems where the sequence nature needs to be remembered like in order to predict an emotion in the scene, previous scenes needs to be remembered.

rnn gif

Packing Padded Sequences



When training RNN (LSTM or GRU or vanilla-RNN), it is difficult to batch the variable-length sequences. Ideally we will pad all the sequences to a fixed length and end up doing un-necessary computations. How can we overcome this? PyTorch provides the pack_padded_sequences functionality.

pack img

Attention Mechanism - Luong



The attention mechanism was born to help memorize long source sentences in neural machine translation (NMT). Rather than building a single context vector out of the encoder's last hidden state, attention is used to focus more on the relevant parts of the input while decoding a sentence. The context vector will be created by taking encoder outputs and the current output of the decoder rnn.

pack img

The attention score can be calculated in three ways. dot, general and concat.

luong_fn

Attention Mechanism - Bahdanau



The major difference between Bahdanau & Luong attention is the way the context vector is created. The context vector will be created by taking encoder outputs and the previous hidden state of the decoder rnn. Where is in Luong attention the context vector will be created by taking encoder outputs and the current hidden state of the decoder rnn.

Once the context is calculated it is combined with decoder input embedding and fed as input to decoder rnn.

pack img

The Bahdanau attention is also called as additive attention.

bahdanau_fn

Transformer



Attention mechanisms have become an integral part of compelling sequence modeling and transduction models in various tasks, allowing modeling of dependencies without regard to their distance in the input or output sequences. Such attention mechanisms are used in conjunction with a recurrent network.

The Transformer, a model architecture eschewing recurrence and instead relying entirely on an attention mechanism to draw global dependencies between input and output.

transformer

GPT-2



The GPT-2 paper states that:

Natural language processing tasks, such as question answering, machine translation, reading comprehension, and summarization, are typically approached with supervised learning on taskspecific datasets. We demonstrate that language models begin to learn these tasks without any explicit supervision when trained on a new dataset of millions of webpages called WebText. Our largest model, GPT-2, is a 1.5B parameter Transformer that achieves state of the art results on 7 out of 8 tested language modeling datasets in a zero-shot setting but still underfits WebText. Samples from the model reflect these improvements and contain coherent paragraphs of text. These findings suggest a promising path towards building language processing systems which learn to perform tasks from their naturally occurring demonstrations.

gpt2

The GPT-2 utilizes a 12-layer Decoder Only Transformer architecture.

gpt2

BERT



Language modeling is an effective task for using unlabeled data to pretrain neural networks in NLP. Traditional language models take the previous n tokens and predict the next one. In contrast, BERT trains a language model that takes both the previous and next tokens into account when predicting. BERT is also trained on a next sentence prediction task to better handle tasks that require reasoning about the relationship between two sentences (e.g. similar questions or not)

bert

BERT uses the Transformer architecture for encoding sentences.

bert

Pointer Network



Pointer networks are sequence-to-sequence models where the output is discrete tokens corresponding to positions in an input sequence. The main differences between pointer networks and standard seq2seq models are:
  • The output of pointer networks is discrete and correspond to positions in the input sequence

  • The number of target classes in each step of the output depends on the length of the input, which is variable.

It differs from the previous attention attempts in that, instead of using attention to blend hidden units of an encoder to a context vector at each decoder step, it uses attention as a pointer to select a member of the input sequence as the output.

pointer

Topic Modelling using LDA



One of the primary applications of natural language processing is to automatically extract what topics people are discussing from large volumes of text. Some examples of large text could be feeds from social media, customer reviews of hotels, movies, etc, user feedbacks, news stories, e-mails of customer complaints etc.

Knowing what people are talking about and understanding their problems and opinions is highly valuable to businesses, administrators, political campaigns. And it’s really hard to manually read through such large volumes and compile the topics.

Thus is required an automated algorithm that can read through the text documents and automatically output the topics discussed.

In this notebook, we will take a real example of the 20 Newsgroups dataset and use LDA to extract the naturally discussed topics.

lda

LDA’s approach to topic modeling is it considers each document as a collection of topics in a certain proportion. And each topic as a collection of keywords, again, in a certain proportion.

Once you provide the algorithm with the number of topics, all it does it to rearrange the topics distribution within the documents and keywords distribution within the topics to obtain a good composition of topic-keywords distribution.

Principal Component Analysis(PCA)



PCA is fundamentally a dimensionality reduction technique that transforms the columns of a dataset into a new set features. It does this by finding a new set of directions (like X and Y axes) that explain the maximum variability in the data. This new system coordinate axes is called Principal Components (PCs).

pca

Practically PCA is used for two reasons:

  • Dimensionality Reduction: The information distributed across a large number of columns is transformed into principal components (PC) such that the first few PCs can explain a sizeable chunk of the total information (variance). These PCs can be used as explanatory variables in Machine Learning models.

  • Visualize Data: Visualising the separation of classes (or clusters) is hard for data with more than 3 dimensions (features). With the first two PCs itself, it’s usually possible to see a clear separation.

Naive Bayes Algorithm



A Naive Bayes classifier is a probabilistic machine learning model that’s used for classification task. The crux of the classifier is based on the Bayes theorem.

naive

Using Bayes theorem, we can find the probability of A happening, given that B has occurred. Here, B is the evidence and A is the hypothesis. The assumption made here is that the predictors/features are independent. That is presence of one particular feature does not affect the other. Hence it is called naive.

Types of Naive Bayes Classifier:

Multinomial Naive Bayes: This is mostly used when the variables are discrete (like words). The features/predictors used by the classifier are the frequency of the words present in the document.

Gaussian Naive Bayes: When the predictors take up a continuous value and are not discrete, we assume that these values are sampled from a gaussian distribution.

Bernoulli Naive Bayes: This is similar to the multinomial naive bayes but the predictors are boolean variables. The parameters that we use to predict the class variable take up only values yes or no, for example if a word occurs in the text or not.

Using 20newsgroup dataset, naive bayes algorithm is explored to do the classification.

Data Augmentation in NLP



In Computer Vision using image data augmentation is a standard practice. This is because trivial operations for images like rotating an image a few degrees or converting it into grayscale doesn’t change its semantics. Whereas in natural language processing (NLP) field, it is hard to augmenting text due to high complexity of language.

Data Augmentation using the following techniques is explored:

  • Synonym-based Substitution
  • Antonym-based Substitution
  • Back Translation
  • Text Surface Transformation
  • Random Noise Injection
  • Word Embedding based Substitution
  • Contextual Word Embeddings (BERT family) based Substitution

aug

Sentence Embeddings



A new architecture called SBERT was explored. The siamese network architecture enables that fixed-sized vectors for input sentences can be derived. Using a similarity measure like cosinesimilarity or Manhatten / Euclidean distance, semantically similar sentences can be found.

sentence emb

Classification based Applications

Sentiment Analysis - IMDB Sentiment Classification - Hinglish Document Classification
Duplicate Question Pair Classification - Quora POS Tagging Natural Language Inference - SNLI
Toxic Comment Classification Grammatically Correct Sentence - CoLA NER Tagging

Sentiment Analysis - IMDB

Sentiment analysis refers to the use of natural language processing, text analysis, computational linguistics, and biometrics to systematically identify, extract, quantify, and study affective states and subjective information.

Following varients have been explored:

Simple Sentiment Classification with RNN



RNN is used for processing and identifying the sentiment.

sentiment

Improved Sentiment Classification



After trying the basic RNN which gives a test_accuracy less than 50%, following techniques have been experimented and a test_accuracy above 88% is achieved.

Techniques used:

  • packed padded sequences
  • pre-trained word embeddings
  • different RNN architecture
  • bidirectional RNN
  • multi-layer RNN
  • regularization
  • a different optimizer

sentiment

Sentiment Analysis with Attention



Attention helps in focusing on the relevant input when predicting the sentiment of the input. Bahdanau attention was used with taking the outputs of LSTM and concatenating the final forward & backward hidden state. Without using the pre-trained word embeddings, test accuracy of 88% is achieved.

sentiment

Sentiment Analysis with BERT



BERT obtains new state-of-the-art results on eleven natural language processing tasks. Transfer learning in NLP has triggered after the release of BERT model. Using BERT to do the sentiment analysis is explored.

sentiment

Sentiment Analysis - HINGLISH

Mixing languages, also known as code-mixing, is a norm in multilingual societies. Multilingual people, who are non-native English speakers, tend to code-mix using English-based phonetic typing and the insertion of anglicisms in their main language.

The task is to predict the sentiment of a given code-mixed tweet. The sentiment labels are positive, negative, or neutral, and the code-mixed languages will be English-Hindi. (Sentimix)

Following varients have been explored:

Sentimix using MLP



Using the simple MLP model, F1 score of 0.58 was achieved on test data

codemix

Sentimix using LSTM



After exploring the basic MLP model, the LSTM model was used for sentiment prediction and F1 score of 0.57 was achieved.

codemix

The results were actually less compared to a basic MLP model. One of the reason might be LSTM is not able to learn the relationships among the words in a sentence due to the highly diverse nature of the Code-mixed data.

Sentimix with XLM-RoBERTa



As the LSTM is not able to learn the relationships between the words in a code-mixed sentence due to the highly diverse nature of the Code-mixed data and no pre-trained embeddings are used, the F1 score is less.

To alleviate this problem XLM-RoBERTa model (which has been pre-trained on 100 languages) is being used to encode the sentence. In order to use the XLM-RoBERTa model, the sentence needs to be in a proper language. So first the Hinglish words need to be converted to the Hindi (Devanagari) form.

codemix

An F1 score of 0.59 was achieved. Methods to improve this will be explored later on.

Sentimix with XLM-RoBERTa-LSTM-Attention



The final output from XLM-RoBERTa model were used as input embeddings to bi-directional LSTM model. An attention layer, which takes the outputs from LSTM layer, produces a weighted representation of the input, which is then passed through a classifier for predicting the sentiment of the sentence.

codemix

An F1 score of 0.64 was achieved.

Sentimix with XLM-RoBERTA-CNN



In the same way that a 3x3 filter can look over a patch of an image, a 1x2 filter can look over a 2 sequential words in a piece of text, i.e. a bi-gram. In this CNN model we will instead use multiple filters of different sizes which will look at the bi-grams (a 1x2 filter), tri-grams (a 1x3 filter) and/or n-grams (a 1xn filter) within the text.

The intuition here is that the appearance of certain bi-grams, tri-grams and n-grams within the review will be a good indication of the final sentiment.

codemix

An F1 score of 0.69 was achieved.

Sentimix with XLM-RoBERTA-Ensemble



CNN captures the local dependencies where as RNN captures the global dependencies. By combining both we can get better understanding of the data. Ensembling of CNN model and Bidirectional-GRU-Attention model out performs the other ones.

codemix

An F1 score of 0.71 was achieved. (Top 5 in the leaderboard).

Document Classification

Document classification or document categorization is a problem in library science, information science and computer science. The task is to assign a document to one or more classes or categories.

Following varients have been explored:

Document Classification with Hierarchical Attention Network



A Hierarchical Attention Network (HAN) considers the hierarchical structure of documents (document - sentences - words) and includes an attention mechanism that is able to find the most important words and sentences in a document while taking the context into consideration.

han

Improved HAN with regularization techniques



The basic HAN model is overfitting rapidly. In order to overcome this, techniques like Embedding Dropout, Locked Dropout are explored. There is one more other technique called Weight Dropout which is not implemented (Let me know if there are any good resources to implement this). Pre-trained word embeddings Glove are also used instead of random initialization. Since the attention can be done on sentence level and word level, we can visualize which words are important in a sentence and which sentences are important in a document.

han

han

han

Duplicate Question Pair Classification - Quora

QQP stands for Quora Question Pairs. The objective of the task is for a given pair of questions; we need to find whether those questions are semantically similar to each other or not.

Following varients have been explored:

QQP Classification with Siamese Network



The algorithm needs to take the pair of questions as input and should output their similarity. A Siamese network is used. A Siamese neural network (sometimes called a twin neural network) is an artificial neural network that uses the same weights while working in tandem on two different input vectors to compute comparable output vectors.

qqp

QQP Classification with BERT



After trying the siamese model, BERT was explored to do the Quora duplicate question pairs detection. BERT takes the question 1 and question 2 as input separated by [SEP] token and the classification was done using the final representation of [CLS] token.

qqp

POS Tagging

Part-of-Speech (PoS) tagging, is a task of labelling each word in a sentence with its appropriate part of speech.

Following varients have been explored:

POS Tagging with BiLSTM



This code covers the basic workflow. We'll learn how to: load data, create train/test/validation splits, build a vocabulary, create data iterators, define a model and implement the train/evaluate/test loop and run time (inference) tagging.

The model used is a Multi Layer Bi-directional LSTM network

pos

POS Tagging with Transformer



After trying the RNN approach, POS tagging with Transformer based architecture is explored. Since the Transformer contains both Encoder and Decoder and for the sequence labeling task only Encoder will be sufficient. As the data is small having 6 layers of Encoder will overfit the data. So a 3-layer Transformer Encoder model was used.

pos

POS Tagging with BERT



After trying POS tagging with Transformer Encoder, POS Tagging with pre-trained BERT model is exploed. It achieved test accuracy of 91%.

pos

Natural Language Inference - SNLI

The goal of natural language inference (NLI), a widely-studied natural language processing task, is to determine if one given statement (a premise) semantically entails another given statement (a hypothesis).

Following varients have been explored:

NLI with BiLSTM



A basic model with Siamese BiLSTM network is implemeted

nli

This can be treated as base-line setup. A test accuracy of 76.84% was achieved.

NLI with Attention



In the previous notebook, the final hidden states of Premise and Hypothesis as the representations from LSTM. Now instead of taking the final hidden states, attention will be computed across all the input tokens and a final weighted vector is taken as the representation of Premise and Hypothesis.

nli

The test accuracy increased from 76.84% to 79.51%.

NLI with Transformer



Transformer Encoder was used to encode the Premise and Hypothesis. Once the sentence is passed through the Encoder, summation of all the tokens is considered as the final representation (others variants can be explored). The model accuracy is less compared to RNN variants.

nli

NLI with BERT



NLI with Bert base model was explored. BERT takes the Premise and Hypothesis as inputs separated by [SEP] token and the classification was done using the final representation of [CLS] token.

nli

NLI with Distillation



Distillation: A technique you can use to compress a large model, called the teacher, into a smaller model, called the student. Following student, teacher models are used in order to perform distillation on NLI.

  • Student Model: Logistic Regression
  • Teacher Model: Bi-directional LSTM with Attention

nli

Toxic Comment Classification

Discussing things you care about can be difficult. The threat of abuse and harassment online means that many people stop expressing themselves and give up on seeking different opinions. Platforms struggle to effectively facilitate conversations, leading many communities to limit or completely shut down user comments.

You are provided with a large number of Wikipedia comments which have been labeled by human raters for toxic behavior. The types of toxicity are:

  • toxic
  • severe_toxic
  • obscene
  • threat
  • insult
  • identity_hate

Following varients have been explored:

Toxic Comment Classification with GRU



The model used is a Bi-directional GRU network.

toxic

A test accuracy of 99.42% was achieved. Since 90% of the data is not labeled into any of the toxicity, simply predicting all the data as non-toxic gives a 90% accurate model. So accuracy is not a reliable metric. A different metric ROC AUC was implemented.

Improved Toxic Comment Classification



With Categorical Cross Entropy as the loss, roc_auc score of 0.5 is achieved. By changing the loss to Binary Cross Entropy and also modifying the model a bit by adding pooling layers (max, mean), the roc_auc score improved to 0.9873.

toxic

Toxic Comment Classification using Streamlit



Converted the Toxic Comment Classification into an app using streamlit. The pre-trained model is available now.

toxic

Grammatically Correct Sentence - CoLA

Can artificial neural networks have the ability to judge the grammatical acceptability of a sentence? In order to explore this task, the Corpus of Linguistic Acceptability (CoLA) dataset is used. CoLA is a set of sentences labeled as grammatically correct or incorrect.

Following varients have been explored:

CoLA with BERT



BERT obtains new state-of-the-art results on eleven natural language processing tasks. Transfer learning in NLP has triggered after the release of BERT model. In this notebook, we will explore how to use BERT for classifying whether a sentence is grammatically correct or not using CoLA dataset.

cola

An accuracy of 85% and Matthews Correlation Coefficient (MCC) of 64.1 were achieved.

CoLA with DistilBERT



Distillation: A technique you can use to compress a large model, called the teacher, into a smaller model, called the student. Following student, teacher models are used in order to perform distillation on CoLA.

  • Student Model: Distilbert base uncased
  • Teacher Model: Bert base uncased

cola

Following experiments have been tried:

  • Training using Bert Model (Teacher). Acc: 84.06, MCC: 61.5
  • Training using Distilbert Model (without teacher forcing). Acc: 82.54, MCC: 57
  • Training using Distilbert Model (with teacher forcing). Acc: 82.92, MCC: 57.9

NER Tagging

Named-Entity-Recognition (NER) tagging, is a task of labelling each word in a sentence with its appropriate entity.

Following varients have been explored:

NER Tagging with BiLSTM



This code covers the basic workflow. We'll see how to: load data, create train/test/validation splits, build a vocabulary, create data iterators, define a model and implement the train/evaluate/test loop and train, test the model.

The model used is Bi-directional LSTM network

ner

NER Tagging with BiLSTM-CRF



In the case of Sequence tagging (NER) the tag of a current word might depend on previous word's tag. (ex: New York).

Without a CRF, we would have simply used a single linear layer to transform the output of the Bidirectional LSTM into scores for each tag. These are known as emission scores, which are a representation of the likelihood of the word being a certain tag.

A CRF calculates not only the emission scores but also the transition scores, which are the likelihood of a word being a certain tag considering the previous word was a certain tag. Therefore the transition scores measure how likely it is to transition from one tag to another.

ner

NER Decoding using Viterbi Algorithm



For decoding, Viterbi algorithm is used.

Since we're using CRFs, we're not so much predicting the right label at each word as we are predicting the right label sequence for a word sequence. Viterbi Decoding is a way to do exactly this – find the most optimal tag sequence from the scores computed by a Conditional Random Field.

ner

NER tagging with Char-BiLSTM-CRF



Using sub-word information in our tagging task because it can be a powerful indicator of the tags, whether they're parts of speech or entities. For example, it may learn that adjectives commonly end with "-y" or "-ul", or that places often end with "-land" or "-burg".

Therefore, our sequence tagging model uses both

  • word-level information in the form of word embeddings.
  • character-level information up to and including each word in both directions.

ner

Evaluation metrics for NER tagging



Micro and macro-averages (for whatever metric) will compute slightly different things, and thus their interpretation differs. A macro-average will compute the metric independently for each class and then take the average (hence treating all classes equally), whereas a micro-average will aggregate the contributions of all classes to compute the average metric. In a multi-class classification setup, micro-average is preferable if you suspect there might be class imbalance (i.e you may have many more examples of one class than of other classes).

ner

ner

NER using Streamlit



Converted the NER tagging into an app using streamlit. The pre-trained model(Char-BiLSTM-CRF) is available now.

ner

NER tagging with Transformer



After trying the RNN approach, NER tagging with Transformer based architecture is explored. Since the Transformer contains both Encoder and Decoder and for the sequence labeling task only Encoder will be sufficient. A 3-layer Transformer Encoder model was used.

ner

NER tagging with BERT



After trying NER tagging with Transformer Encoder, NER Tagging with pre-trained bert-base-cased model is explored.

ner

NER tagging with Transformer-CRF



The transformer alone is not giving good results compared to BiLSTM in the NER tagging task. Augmenting the CRF layer on top of the transformer is implemented which is improving results compared to standalone Transformer.

ner

NER tagging with Spacy



SpaCy provides an exceptionally efficient statistical system for NER in python, which can assign labels to groups of tokens. It provides a default model which can recognize a wide range of named or numerical entities, which include person, organization, language, event etc.

ner

Apart from these default entities, spaCy also gives us the liberty to add arbitrary classes to the NER model, by training the model to update it with newer trained examples.

2 new entities called ACTIVITY and SERVICE in a specific domain data (bank) are created and trained with few training samples.

ner

Generation based Applications

Name Generation Machine Translation Utterance Generation
Image Captioning Image Captioning - LaTeX Equations News Summarization
Email Subject Generation

Name Generation with LSTM



A character-level LSTM language model is used. That is, we’ll give the LSTM a huge chunk of names and ask it to model the probability distribution of the next character in the sequence given a sequence of previous characters. This will then allow us to generate new name one character at a time

name_gen

Machine Translation

Machine Translation (MT) is the task of automatically converting one natural language into another, preserving the meaning of the input text, and producing fluent text in the output language. Ideally, a source language sequence is translated into target language sequence. The task is to convert sentences from German to English.

Following varients have been explored:

Basic Machine Translation



The most common sequence-to-sequence (seq2seq) models are encoder-decoder models, which commonly use a recurrent neural network (RNN) to encode the source (input) sentence into a single vector. In this notebook, we'll refer to this single vector as a context vector. We can think of the context vector as being an abstract representation of the entire input sentence. This vector is then decoded by a second RNN which learns to output the target (output) sentence by generating it one word at a time.

basic_mt

Improved Machine Translation



After trying the basic machine translation which has text perplexity 36.68, following techniques have been experimented and a test perplexity 7.041.

  • GRU is used instead of LSTM
  • Single layer
  • Context vector is sent to decoder rnn along with decoder input embedding
  • Context vector is sent to classifier along with the decoder hidden state

improved_mt

Checkout the code in applications/generation folder

Machine Translation with Bahdanau Attention



The attention mechanism was born to help memorize long source sentences in neural machine translation (NMT). Rather than building a single context vector out of the encoder's last hidden state, attention is used to focus more on the relevant parts of the input while decoding a sentence. The context vector will be created by taking encoder outputs and the previous hidden state of the decoder rnn.

bahdanau_mt

Masking, Packing padded inputs, Attention Visualization, BLEU on MT



Enhancements like masking (ignoring the attention over padded input), packing padded sequences (for better computation), attention visualization and BLEU metric on test data are implemented.

mt_visual

Machine Translation with Transformer



The Transformer, a model architecture eschewing recurrence and instead relying entirely on an attention mechanism to draw global dependencies between input and output is used to do Machine translation from German to English

mt_visual

Self Attention Visualization



Run time translation (Inference) and attention visualization are added for the transformer based machine translation model.

mt_visual

Utterance Generation

Utterance generation is an important problem in NLP, especially in question answering, information retrieval, information extraction, conversation systems, to name a few. It could also be used to create synthentic training data for many NLP problems.

Following varients have been explored:

Basic Utterance Generation



The most common used model for this kind of application is sequence-to-sequence network. A basic 2 layer LSTM was used.

utt_gen

Utterance Generation with Attention



The attention mechanism will help in memorizing long sentences. Rather than building a single context vector out of the encoder's last hidden state, attention is used to focus more on the relevant parts of the input while decoding a sentence. The context vector will be created by taking encoder outputs and the hidden state of the decoder rnn.

After trying the basic LSTM apporach, Utterance generation with attention mechanism was implemented. Inference (run time generation) was also implemented.

utt_gen

While generating the a word in the utterance, decoder will attend over encoder inputs to find the most relevant word. This process can be visualized.

utt_gen

Utterance Generation with Beam Search



One of the ways to mitigate the repetition in the generation of utterances is to use Beam Search. By choosing the top-scored word at each step (greedy) may lead to a sub-optimal solution but by choosing a lower scored word that may reach an optimal solution.

Instead of greedily choosing the most likely next step as the sequence is constructed, the beam search expands all possible next steps and keeps the k most likely, where k is a user-specified parameter and controls the number of beams or parallel searches through the sequence of probabilities.

utt_gen

Utterance Generation with Coverage



Repetition is a common problem for sequenceto-sequence models, and is especially pronounced when generating a multi-sentence text. In coverage model, we maintain a coverage vector c^t, which is the sum of attention distributions over all previous decoder timesteps

utt_gen

This ensures that the attention mechanism’s current decision (choosing where to attend next) is informed by a reminder of its previous decisions (summarized in c^t). This should make it easier for the attention mechanism to avoid repeatedly attending to the same locations, and thus avoid generating repetitive text.

Utterance Generation with Transformer



The Transformer, a model architecture eschewing recurrence and instead relying entirely on an attention mechanism to draw global dependencies between input and output is used to do generate utterance from a given sentence. The training time was also lot faster 4x times compared to RNN based architecture.

utt_gen

Beam Search in Utterance Generation with Transformer



Added beam search to utterance generation with transformers. With beam search, the generated utterances are more diverse and can be more than 1 (which is the case of the greedy approach). This implemented was better than naive one implemented previously.

utt_gen

Utterance Generation with BPE Tokenization



Utterance generation using BPE tokenization instead of Spacy is implemented.

Today, subword tokenization schemes inspired by BPE have become the norm in most advanced models including the very popular family of contextual language models like BERT, GPT-2,RoBERTa, etc.

BPE brings the perfect balance between character and word-level hybrid representations which makes it capable of managing large corpora. This behavior also enables the encoding of any rare words in the vocabulary with appropriate subword tokens without introducing any “unknown” tokens.

utt_gen

Utterance Generation using Streamlit



Converted the Utterance Generation into an app using streamlit. The pre-trained model trained on the Quora dataset is available now.

utt_gen

General Utterance Generation



Till now the Utterance Generation is trained using the Quora Question Pairs dataset, which contains sentences in the form of questions. When given a normal sentence (which is not in a question format) the generated utterances are very poor. This is due the bias induced by the dataset. Since the model is only trained on question type sentences, it fails to generate utterances in case of normal sentences. In order to generate utterances for a normal sentence, COCO dataset is used to train the model. utt_gen

utt_gen

Image Captioning

Image Captioning is the process of generating a textual description of an image. It uses both Natural Language Processing and Computer Vision techniques to generate the captions.

Flickr8K dataset is used. It contains 8092 images, each image having 5 captions.

Following varients have been explored:

Basic Image Captioning



The encoder-decoder framework is widely used for this task. The image encoder is a convolutional neural network (CNN). The decoder is a recurrent neural network(RNN) which takes in the encoded image and generates the caption.

In this notebook, the resnet-152 model pretrained on the ILSVRC-2012-CLS image classification dataset is used as the encoder. The decoder is a long short-term memory (LSTM) network.

img_cap

Image Captioning with Attention



In this notebook, the resnet-101 model pretrained on the ILSVRC-2012-CLS image classification dataset is used as the encoder. The decoder is a long short-term memory (LSTM) network. Attention is implemented. Instead of the simple average, we use the weighted average across all pixels, with the weights of the important pixels being greater. This weighted representation of the image can be concatenated with the previously generated word at each step to generate the next word of the caption.

img_cap

Image Captioning with Beam Search



Instead of greedily choosing the most likely next step as the caption is constructed, the beam search expands all possible next steps and keeps the k most likely, where k is a user-specified parameter and controls the number of beams or parallel searches through the sequence of probabilities.

img_cap

Image Captioning with BPE Tokenization



Today, subword tokenization schemes inspired by BPE have become the norm in most advanced models including the very popular family of contextual language models like BERT, GPT-2,RoBERTa, etc.

BPE brings the perfect balance between character and word-level hybrid representations which makes it capable of managing large corpora. This behavior also enables the encoding of any rare words in the vocabulary with appropriate subword tokens without introducing any “unknown” tokens.

BPE was used in order to tokenize the captions instead of using nltk.

img_cap

Image Captioning - LaTex Equations

An application of image captioning is to convert the the equation present in the image to latex format.

Following varients have been explored:

Basic Image to Latex



An application of image captioning is to convert the the equation present in the image to latex format. Basic Sequence-to-Sequence models is used. CNN is used as encoder and RNN as decoder. Im2latex dataset is used. It contains 100K samples comprising of training, validation and test splits.

img_cap

Generated formulas are not great. Following notebooks will explore techniques to improve it.

Image to Latex with Attention



Latex code generation using the attention mechanism is implemented. Instead of the simple average, we use the weighted average across all pixels, with the weights of the important pixels being greater. This weighted representation of the image can be concatenated with the previously generated word at each step to generate the next word of the formula.

img_cap

Image to Latex with Beam Search



Added beam search in the decoding process. Also added Positional encoding to the input image and learning rate scheduler.

Image to LaTex Conversion using Streamlit



Converted the Latex formula generation into an app using streamlit.

latex

News Summarization with T5



Automatic text summarization is the task of producing a concise and fluent summary while preserving key information content and overall meaning. Have you come across the mobile app inshorts? It’s an innovative news app that converts news articles into a 60-word summary. And that is exactly what we are going to do in this notebook. The model used for this task is T5.

news_sum

Email Subject Generation with T5.



Given the overwhelming number of emails, an effective subject line becomes essential to better inform the recipient of the email's content.

Email subject generation using T5 model was explored. AESLC dataset was used for this purpose.

email_sub

Question-Answering based Applications

Clustering based Applications

Topic Identification in News Covid Article finding

Topic Identification in News

Topic Identification is a Natural Language Processing (NLP) is the task to automatically extract meaning from texts by identifying recurrent themes or topics.

Following varients have been explored:

Topic Identification in News using LDA



LDA’s approach to topic modeling is it considers each document as a collection of topics in a certain proportion. And each topic as a collection of keywords, again, in a certain proportion.

Once you provide the algorithm with the number of topics, all it does it to rearrange the topics distribution within the documents and keywords distribution within the topics to obtain a good composition of topic-keywords distribution.

20 Newsgroup dataset was used and only the articles are provided to identify the topics. Topic Modelling algorithms will provide for each topic what are the important words. It is upto us to infer the topic name.

lda

Improved Topic Identification in News using LDA



Choosing the number of topics is a difficult job in Topic Modelling. In order to choose the optimal number of topics, grid search is performed on various hypermeters. In order to choose the best model the model having the best perplexity score is choosed.

A good topic model will have non-overlapping, fairly big sized blobs for each topic.

lda

  • LDA using scikit-learn is implemented.
  • Inference (predicting the topic of a given sentence) is also implemented.

Topic Identification in News using LSA



We would clearly expect that the words that appear most frequently in one topic would appear less frequently in the other - otherwise that word wouldn't make a good choice to separate out the two topics. Therefore, we expect the topics to be orthogonal.

Latent Semantic Analysis (LSA) uses SVD. You will sometimes hear topic modelling referred to as LSA.

The SVD algorithm factorizes a matrix into one matrix with orthogonal columns and one with orthogonal rows (along with a diagonal matrix, which contains the relative importance of each factor).

svd

Notes:

  • SVD is a determined dimension reduction algorithm
  • LDA is a probability-based generative model

Covid article finding using LDA



Finding the relevant article from a covid-19 research article corpus of 50K+ documents using LDA is explored.

The documents are first clustered into different topics using LDA. For a given query, dominant topic will be found using the trained LDA. Once the topic is found, most relevant articles will be fetched using the jensenshannon distance.

Only abstracts are used for the LDA model training. LDA model was trained using 35 topics.

lda

Question Answering based Applications

Factual Question Answering Visual Question Answering Boolean Question Answering
Closed Question Answering

Factual Question Answering

Given a set of facts, question concering them needs to be answered. Dataset used is bAbI which has 20 tasks with an amalgamation of inputs, queries and answers. See the following figure for sample.

babi

Following varients have been explored:

Basic Question Answering with Dynamic Memory Networks



Dynamic Memory Network (DMN) is a neural network architecture which processes input sequences and questions, forms episodic memories, and generates relevant answers.

dmn

Question Answering using DMN Plus



The main difference between DMN+ and DMN is the improved InputModule for calculating the facts from input sentences keeping in mind the exchange of information between input sentences using a Bidirectional GRU and a improved version of MemoryModule using Attention based GRU model.

dmn

Visual Question Answering



Visual Question Answering (VQA) is the task of given an image and a natural language question about the image, the task is to provide an accurate natural language answer.

Following varients have been explored:

Basic Visual Question Answering



vqa

The model uses a two layer LSTM to encode the questions and the last hidden layer of VGGNet to encode the images. The image features are then l_2 normalized. Both the question and image features are transformed to a common space and fused via element-wise multiplication, which is then passed through a fully connected layer followed by a softmax layer to obtain a distribution over answers.

Visual Question Answering with DMN Plus



To apply the DMN to visual question answering, input module is modified for images. The module splits an image into small local regions and considers each region equivalent to a sentence in the input module for text.

The input module for VQA is composed of three parts, illustrated in below fig:

  • local region feature extraction
  • visual feature embedding
  • input fusion layer

vqa

Boolean Question Answering



Boolean question answering is to answer whether the question has answer present in the given context or not. The BoolQ dataset contains the queries for complex, non-factoid information, and require difficult entailment-like inference to solve.

boolqa

Closed Question Answering



Stanford Question Answering Dataset (SQuAD) is a reading comprehension dataset, consisting of questions posed by crowdworkers on a set of Wikipedia articles, where the answer to every question is a segment of text, or span, from the corresponding reading passage

Following varients have been explored:

Question Answering using Dynamic-CoAttention-Network



The DCN first fuses co-dependent representations of the question and the document in order to focus on relevant parts of both. Then a dynamic pointing decoder iterates over potential answer spans. This iterative procedure enables the model to recover from initial local maxima corresponding to incorrect answers.

The Dynamic Coattention Network has two major parts: a coattention encoder and a dynamic decoder.

CoAttention Encoder: The model first encodes the given document and question separately via the document and question encoder. The document and question encoders are essentially a one-directional LSTM network with one layer. Then it passes both the document and question encodings to another encoder which computes the coattention via matrix multiplications and outputs the coattention encoding from another bidirectional LSTM network.

Dynamic Decoder: Dynamic decoder is also a one-directional LSTM network with one layer. The model runs the LSTM network through several iterations. In each iteration, the LSTM takes in the final hidden state of the LSTM and the start and end word embeddings of the answer in the last iteration and outputs a new hidden state. Then, the model uses a Highway Maxout Network (HMN) to compute the new start and end word embeddings of the answer in each iteration.

dcn

Question Answering using Double-Cross-Attention

Double Cross Attention (DCA) seems to provide better results compared to both BiDAF and Dynamic Co-Attention Network (DCN). The motivation behind this approach is that first we pay attention to each context and question and then we attend those attentions with respect to each other in a slightly similar way as DCN. The intuition is that if iteratively read/attend both context and question, it should help us to search for answers easily.

I have augmented the Dynamic Decoder part from DCN model in-order to have iterative decoding process which helps finding better answer.

dca

Ranking Based Applications

Covid-19 Browser

Covid-19 Browser



There was a kaggle problem on covid-19 research challenge which has over 1,00,000 + documents. This freely available dataset is provided to the global research community to apply recent advances in natural language processing and other AI techniques to generate new insights in support of the ongoing fight against this infectious disease. There is a growing urgency for these approaches because of the rapid acceleration in new coronavirus literature, making it difficult for the medical research community to keep up.

The procedure I have taken is to convert the abstracts into a embedding representation using sentence-transformers. When a query is asked, it will converted into an embedding and then ranked across the abstracts using cosine similarity.

covid

Recommendation based Applications

Song Recommendation

Song Recommendation



By taking user’s listening queue as a sentence, with each word in that sentence being a song that the user has listened to, training the Word2vec model on those sentences essentially means that for each song the user has listened to in the past, we’re using the songs they have listened to before and after to teach our model that those songs somehow belong to the same context.

song_recom

What’s interesting about those vectors is that similar songs will have weights that are closer together than songs that are unrelated.