A repository
Our project analyzed a dataset CSV file from Kaggle containing 31,935 tweets. The dataset was heavily skewed with 93% of tweets or 29,695 tweets containing non-hate labeled Twitter data and 7% or 2,240 tweets containing hate-labeled Twitter data.
- The first step of building our model was to balance the number of hate and non-hate tweets.
- We clean the tweets by employing lemmatization, stemming, removal of stop words, and omissions.
- Then for the preprocessing step, we use Bag of words and Term Frequency Inverse Document Frequency (TFIDF).
- For both the Bag of words and TFIDF, we run 5 classification algorithms, namely Logistic Regression, Naive Bayes, Decision Tree, Random Forest and Gradient Boosting.
- These 5 algorithms are ran again after performing dimensionality reduction for both TF-IDF and Bag of Words.
The data cleaning.ipynb contains the code for cleaning the tweets. Final working code.ipynb file contains the code for model building and visualisations.