This repository contains source code for the Data Mining Project, IIT (BHU) Varanasi - Hate Speech Detection.
Guided By: Dr. Bhaskar Biswas, Associate Professor, CSE, IIT (BHU) Varanasi.
- Harshit Agrawal - 18074019 (CSE IDD Part 3)
- Ashish Kumar - 18075068 (CSE B.Tech. Part 3)
- Sachin Srivastava - 18075070 (CSE B.Tech. Part 3)
- The project aims at improving the user experience of using any website for online chats, conversation and posts by flagging and removing the textual material containing hate and toxicity.
- Given any text or paragraph containing a few lines in natural language (such as English), the objective is to classify it as belonging to one of the following categories:- normal, obscene, threatening, insulting, toxic, severely toxic and hate.
- This is a multi-class classification problem as well as a multi-label classification problem, since a post can be abusive in multiple ways. The model will output the probability of the post belonging to each of the categories and based on a certain threshold (which can be tuned as a hyperparameter), a comment may be classified to be belonging to a category/set of categories
The dataset has been taken from Conversation AI.
It consists of three files:
- Training Set (train.csv): Contains comments with their labels (0 or 1).
- Test Set (test.csv): We are required to predict the labels of these comments.
- Labels for test data (test_labels.csv): To evaluate our predictions on the test set.
The download links of the pretrained embeddings used in the model:
- Project Description (Project_Description.pdf)
- Presentation (Presentation.pdf)
- Report (Report.pdf)
The code is written in .ipynb files, which contain both the code and their outputs:
- Data Visualization (Visualisation.ipynb)
- SVM - Binary Relevance and Classifier Chains (SupportVectorMachine.ipynb)
- Logistic Regression - Binary Relevance and Classifier Chains (LogisticRegression.ipynb)
- Extra Trees (ExtraTrees.ipynb)
- XGBoost (XGBoost.ipynb)
- LSTM without pretrained embeddings (LSTM_without.ipynb)
- LSTM with FastText embedding (LSTM_fasttext.ipynb)
- LSTM with Glove embedding (LSTM_glove.ipynb)
- LSTM with Word2Vec embedding (LSTM_word2vec.ipynb)
We have used AUC_ROC Score to evaluate the performance of the models. These are the results:
Model | Mean AUC_ROC Score |
---|---|
Support Vector Machines (Binary Relevance) | 0.66 |
Support Vector Machines (Classifier Chains) | 0.67 |
Logistic Regression (Binary Relevance) | 0.73 |
Logistic Regression (Classifier Chains) | 0.76 |
Extra Trees | 0.93 |
XGBoost | 0.96 |
LSTM without pretrained embeddings | 0.97 |
LSTM with FastText embedding | 0.96 |
LSTM with Glove embedding | 0.88 |
LSTM with Word2Vec embedding | 0.85 |