Twitter Sentiment Analysis
The objective of this task is to detect hate speech in tweets. For the sake of simplicity, we say a tweet contains hate speech if it has a racist or sexist sentiment associated with it. So, the task is to classify racist or sexist tweets from other tweets.
Formally, given a training sample of tweets and labels, where label '1' denotes the tweet is racist/sexist and label '0' denotes the tweet is not racist/sexist, your objective is to predict the labels on the test dataset.
- Understanding the Problem Statment
- Tweets Preprocessing and Cleaning
- Data Inspection
- Data Cleaning
- Story Generation and Visualization from Tweets
- Extracting Feature from Cleaned Tweets
- Bag-of-Words
- TF-IDF
- Word Embeddings
- Model Building: Sentiment Analysis
- Logistic Regression
- Support Vector Machine
- RandomForest
- XGBoost
- Model Fine-tuning
- Summary
To classify a set of tweets into two categories:
- racist/sexist
- non-racist/sexist
Data Files
train.csv - For training the models, we provide a labelled dataset of 31,962 tweets. The dataset is provided in the form of a csv file with each line storing a tweet id, its label and the tweet.>
There is 1 test file (public)
test_tweets.csv - The test data file contains only tweet ids and the tweet text with each tweet in a new line.
Evaluation Metric:
The metric used for evaluating the performance of classification model would be F1-Score.
The metric can be understood as -
True Positives (TP) - These are the correctly predicted positive values which means that the value of actual class is yes and the value of predicted class is also yes.
True Negatives (TN) - These are the correctly predicted negative values which means that the value of actual class is no and value of predicted class is also no.
False Positives (FP) – When actual class is no and predicted class is yes.
False Negatives (FN) – When actual class is yes but predicted class in no.
Precision = TP/TP+FP
Recall = TP/TP+FN
F1 Score = 2*(Recall * Precision) / (Recall + Precision)
F1 is usually more useful than accuracy, especially if for an uneven class distribution.