/gtc2020_instructor_training

GPUs in Natural Language Processing

Primary LanguageJupyter NotebookMIT LicenseMIT

GPUs in Natural Language Processing


Intro Slides

GTC GPu Technology

Conference Instructor-Led Training at the GPU Technology Conference (GTC), taking place March 22-26, 2020 at the San Jose McEnery Convention Center in San Jose, California.


  • Title: T22128: GPUs in Natural Language Processing
  • Session Type: Instructor-Led Training
  • Length: 1 Hour 45 Minutes

Run Experiments using:

Note Make sure to use Rapids V0.10 and GPU T4 or V100

  • content/word_embeddings_sentiment_clustering.ipynb
  • content/bert_sentiment_clustering.ipynb

The main purpose of this tutorial is to target a particular Natural Language Processing (NLP) problem, in this case Sentiment Analysis, and use GPUs for great speedup.


Dataset used:

data/imdb_reviews_all_labeled.csv

  • IMDB moview reviews sentiment dataset: This is a dataset for binary sentiment classification containing a set of 25,000 highly polar movie reviews for training, and 25,000 for testing. For this tutorial we will combine the train and test data for a total of 50,000 movies reviews text data and their negative/positive labels.

Notebook used for data prep: data/data_prep.ipynb


NLP - Fine-grained Sentiment Analysis

For most cases sentiment classifiers are used in binary classification (just positive or negative sentiment). That is because fine-grained sentiment classification is a significantly more challenging task!

The typical breakdown of fine-grained sentiment uses five discrete classes, as shown below. As one might imagine, models very easily err on either side of the strong/weak sentiment intensities thanks to the wonderful subtleties of human language.

alt text

Binary class labels may be sufficient for studying large-scale positive/negative sentiment trends in text data such as Tweets, product reviews or customer feedback, but they do have their limitations.

When performing information extraction with comparative expressions, for example:

  • “This OnePlus model X is so much better than Samsung model X.”

  • a fine-grained analysis can provide more precise results to an automated system that prioritizes addressing customer complaints.

  • “The location was truly disgusting ... but the people there were glorious.”

  • dual-polarity sentences can confuse binary sentiment classifiers, leading to incorrect class predictions.

source


Main notebooks :

Word Embeddings Sentiment Clustering

content/word_embeddings_sentiment_clustering.ipynb

Run in Google Colab

Content:

  • Train custom word embeddings using a small Neural Network.
  • Use Lime to explain model predictions.
  • Use the embedding model to create review embeddings.
  • Use GPU to perform kmeans clustering on all 50,000 movies reviews.
  • Find the best splitting K using the Elbow method and Silhouette score.
  • Use k=2 on kmeans and plot the sentiments on both predicted clusters and true labels.
  • Observe the overlap between the predcited labels and true labels and asociate labels to clusters. Visualize the clusters.
  • Try to find a third sentiment using k=3. Observe the overlab between predicted labels and true labels. Visualize the clusters.
  • Repeat previous experiments using different k and observe predicitons overlp with true labels. Visualize the clusters.
  • Visualize samples of text that are predicted with various sentiments.

BERT Sentiment Clustering

content/bert_sentiment_clustering.ipynb

Run in Google Colab

Content:

  • Use Sentence Embeddings from pretrained state of the art language models, in this case bert-base-nli-stsb-mean-tokens, to transform text data data into fixed vector feratures of length 768 features. Performing model inference using GPU.
  • Train small Neural Network and interpret model.
  • Use GPU to perform kmeans clustering on all 50,000 movies reviews.
  • Find the best splitting K using the Elbow method and Silhouette score.
  • Use k=2 on kmeans and plot the sentiments on both predicted clusters and true labels.
  • Observe the overlap between the predcited labels and true labels and asociate labels to clusters. Visualize the clusters.
  • Try to find a third sentiment using k=3. Observe the overlab between predicted labels and true labels. Visualize the clusters.
  • Repeat previous experiments using different k and observe predicitons overlp with true labels. Visualize the clusters.
  • Visualize samples of text that are predicted with various sentiments.