Twitter-Sentiment-Analysis

Summary

Got a Twitter dataset from Kaggle
Cleaned the data using the tweet-preprocessor library and the regular expression library
Splitted the training and the test data by 70/30 ratio
Vectorized the tweets using the CountVectorizer library
Built a model using Support Vector Classifier
Achieved a 95% accuracy

Data Collection

Let's read the context of the dataset to understand the problem statement.
In the training data, tweets are labeled '1' if they are associated with the racist or sexist sentiment. Otherwise, tweets are labeled '0'.

Now that you have an understanding of the dataset, go ahead and download two csv files - the training and the test data. Simply click "Download (5MB)."

Data Exploration (Exploratory Data Analysis)

Notice how there exist special characters like @, #, !, and etc. We will remove these characters later in the data cleaning step.
Check if there are any missing values. There were no missing values for both training and test data.

Data Cleaning

We will clean the data using the tweet-preprocessor library. Here's the link: https://pypi.org/project/tweet-preprocessor/
This library removes URLs, Hashtags, Mentions, Reserved words (RT, FAV), Emojis, and Smileys.
We will also use the regular expression library to remove other special cases that the tweet-preprocessor library didn't have.

Train and Test Split

Now that we have cleaned our data, we will do the test and train split using the train_test_split function.
We will use 70% of the data as the training data and the remaining 30% as the test data.

Vectorize Tweets using CoutVectorizer

Now, we will convert text into numeric form as our model won't be able to understand the human language. We will vectorize the tweets using CountVectorizer. CountVectorizer provides a simple way to both tokenize a collection of text documents and build a vocabulary of known words.
For example, let's say we have a list of text documents like below. document = ["This is Import Data's YouTube channel", "Data Science is my passion and it is fun", "Please subscribe to my channel"]
CountVectorizer combines all the documents and tokenizes them. Then it counts the number of occurrences from each document. The results are shown below.

Model Building

Now that we have vectorized all the tweets, we will build a model to classify the test data.
We will use a supervised learning algorithm, Support Vector Classifier (SVC). It is widely used for binary classifications and multi-class classifications.
You can find more explanation on the scikit-learn documentation page: https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html

Accuracy

importdata/Twitter-Sentiment-Analysis