Conference Instructor-Led Training at the GPU Technology Conference (GTC), taking place March 22-26, 2020 at the San Jose McEnery Convention Center in San Jose, California.
- Title: T22128: GPUs in Natural Language Processing
- Session Type: Instructor-Led Training
- Length: 1 Hour 45 Minutes
Note Make sure to use Rapids V0.10 and GPU T4 or V100
content/word_embeddings_sentiment_clustering.ipynb
content/bert_sentiment_clustering.ipynb
The main purpose of this tutorial is to target a particular Natural Language Processing (NLP) problem, in this case Sentiment Analysis, and use GPUs for great speedup.
data/imdb_reviews_all_labeled.csv
- IMDB moview reviews sentiment dataset: This is a dataset for binary sentiment classification containing a set of 25,000 highly polar movie reviews for training, and 25,000 for testing. For this tutorial we will combine the train and test data for a total of 50,000 movies reviews text data and their negative/positive labels.
Notebook used for data prep:
data/data_prep.ipynb
For most cases sentiment classifiers are used in binary classification (just positive or negative sentiment). That is because fine-grained sentiment classification is a significantly more challenging task!
The typical breakdown of fine-grained sentiment uses five discrete classes, as shown below. As one might imagine, models very easily err on either side of the strong/weak sentiment intensities thanks to the wonderful subtleties of human language.
Binary class labels may be sufficient for studying large-scale positive/negative sentiment trends in text data such as Tweets, product reviews or customer feedback, but they do have their limitations.
When performing information extraction with comparative expressions, for example:
-
“This OnePlus model X is so much better than Samsung model X.”
-
a fine-grained analysis can provide more precise results to an automated system that prioritizes addressing customer complaints.
-
“The location was truly disgusting ... but the people there were glorious.”
-
dual-polarity sentences can confuse binary sentiment classifiers, leading to incorrect class predictions.
content/word_embeddings_sentiment_clustering.ipynb
- Train custom word embeddings using a small Neural Network.
- Use Lime to explain model predictions.
- Use the embedding model to create review embeddings.
- Use GPU to perform kmeans clustering on all 50,000 movies reviews.
- Find the best splitting K using the Elbow method and Silhouette score.
- Use k=2 on kmeans and plot the sentiments on both predicted clusters and true labels.
- Observe the overlap between the predcited labels and true labels and asociate labels to clusters. Visualize the clusters.
- Try to find a third sentiment using k=3. Observe the overlab between predicted labels and true labels. Visualize the clusters.
- Repeat previous experiments using different k and observe predicitons overlp with true labels. Visualize the clusters.
- Visualize samples of text that are predicted with various sentiments.
content/bert_sentiment_clustering.ipynb
- Use Sentence Embeddings from pretrained state of the art language models, in this case bert-base-nli-stsb-mean-tokens, to transform text data data into fixed vector feratures of length 768 features. Performing model inference using GPU.
- Train small Neural Network and interpret model.
- Use GPU to perform kmeans clustering on all 50,000 movies reviews.
- Find the best splitting K using the Elbow method and Silhouette score.
- Use k=2 on kmeans and plot the sentiments on both predicted clusters and true labels.
- Observe the overlap between the predcited labels and true labels and asociate labels to clusters. Visualize the clusters.
- Try to find a third sentiment using k=3. Observe the overlab between predicted labels and true labels. Visualize the clusters.
- Repeat previous experiments using different k and observe predicitons overlp with true labels. Visualize the clusters.
- Visualize samples of text that are predicted with various sentiments.