GPUs in Natural Language Processing

GTC GPu Technology

Conference Instructor-Led Training at the GPU Technology Conference (GTC), taking place March 22-26, 2020 at the San Jose McEnery Convention Center in San Jose, California.

Title: T22128: GPUs in Natural Language Processing
Session Type: Instructor-Led Training
Length: 1 Hour 45 Minutes

Run Experiments using:

Note Make sure to use Rapids V0.10 and GPU T4 or V100

content/word_embeddings_sentiment_clustering.ipynb
content/bert_sentiment_clustering.ipynb

The main purpose of this tutorial is to target a particular Natural Language Processing (NLP) problem, in this case Sentiment Analysis, and use GPUs for great speedup.

Dataset used:

data/imdb_reviews_all_labeled.csv

IMDB moview reviews sentiment dataset: This is a dataset for binary sentiment classification containing a set of 25,000 highly polar movie reviews for training, and 25,000 for testing. For this tutorial we will combine the train and test data for a total of 50,000 movies reviews text data and their negative/positive labels.

Notebook used for data prep: data/data_prep.ipynb

NLP - Fine-grained Sentiment Analysis

For most cases sentiment classifiers are used in binary classification (just positive or negative sentiment). That is because fine-grained sentiment classification is a significantly more challenging task!

The typical breakdown of fine-grained sentiment uses five discrete classes, as shown below. As one might imagine, models very easily err on either side of the strong/weak sentiment intensities thanks to the wonderful subtleties of human language.

Binary class labels may be sufficient for studying large-scale positive/negative sentiment trends in text data such as Tweets, product reviews or customer feedback, but they do have their limitations.

When performing information extraction with comparative expressions, for example:

“This OnePlus model X is so much better than Samsung model X.”
a fine-grained analysis can provide more precise results to an automated system that prioritizes addressing customer complaints.
“The location was truly disgusting ... but the people there were glorious.”
dual-polarity sentences can confuse binary sentiment classifiers, leading to incorrect class predictions.

source

Main notebooks :

Word Embeddings Sentiment Clustering

content/word_embeddings_sentiment_clustering.ipynb

Run in Google Colab

Content:

Train custom word embeddings using a small Neural Network.
Use Lime to explain model predictions.
Use the embedding model to create review embeddings.
Use GPU to perform kmeans clustering on all 50,000 movies reviews.
Find the best splitting K using the Elbow method and Silhouette score.
Use k=2 on kmeans and plot the sentiments on both predicted clusters and true labels.
Observe the overlap between the predcited labels and true labels and asociate labels to clusters. Visualize the clusters.
Try to find a third sentiment using k=3. Observe the overlab between predicted labels and true labels. Visualize the clusters.
Repeat previous experiments using different k and observe predicitons overlp with true labels. Visualize the clusters.
Visualize samples of text that are predicted with various sentiments.

BERT Sentiment Clustering

content/bert_sentiment_clustering.ipynb

Run in Google Colab

Content:

Use Sentence Embeddings from pretrained state of the art language models, in this case bert-base-nli-stsb-mean-tokens, to transform text data data into fixed vector feratures of length 768 features. Performing model inference using GPU.
Train small Neural Network and interpret model.
Use GPU to perform kmeans clustering on all 50,000 movies reviews.
Find the best splitting K using the Elbow method and Silhouette score.
Use k=2 on kmeans and plot the sentiments on both predicted clusters and true labels.
Observe the overlap between the predcited labels and true labels and asociate labels to clusters. Visualize the clusters.
Try to find a third sentiment using k=3. Observe the overlab between predicted labels and true labels. Visualize the clusters.
Repeat previous experiments using different k and observe predicitons overlp with true labels. Visualize the clusters.
Visualize samples of text that are predicted with various sentiments.

gmihaila/gtc2020_instructor_training

GPUs in Natural Language Processing

GTC GPu Technology

Run Experiments using:

Dataset used:

NLP - Fine-grained Sentiment Analysis

Main notebooks :

Word Embeddings Sentiment Clustering

Content:

BERT Sentiment Clustering

Content: