Data-Mining-Project (CSE-362)

This repository contains source code for the Data Mining Project, IIT (BHU) Varanasi - Hate Speech Detection.

Guided By: Dr. Bhaskar Biswas, Associate Professor, CSE, IIT (BHU) Varanasi.

The project aims at improving the user experience of using any website for online chats, conversation and posts by flagging and removing the textual material containing hate and toxicity.
Given any text or paragraph containing a few lines in natural language (such as English), the objective is to classify it as belonging to one of the following categories:- normal, obscene, threatening, insulting, toxic, severely toxic and hate.
This is a multi-class classification problem as well as a multi-label classification problem, since a post can be abusive in multiple ways. The model will output the probability of the post belonging to each of the categories and based on a certain threshold (which can be tuned as a hyperparameter), a comment may be classified to be belonging to a category/set of categories

The dataset has been taken from Conversation AI.

It consists of three files:

Training Set (train.csv): Contains comments with their labels (0 or 1).
Test Set (test.csv): We are required to predict the labels of these comments.
Labels for test data (test_labels.csv): To evaluate our predictions on the test set.

The download links of the pretrained embeddings used in the model:

The code is written in .ipynb files, which contain both the code and their outputs:

Data Visualization (Visualisation.ipynb)
SVM - Binary Relevance and Classifier Chains (SupportVectorMachine.ipynb)
Logistic Regression - Binary Relevance and Classifier Chains (LogisticRegression.ipynb)
Extra Trees (ExtraTrees.ipynb)
XGBoost (XGBoost.ipynb)
LSTM without pretrained embeddings (LSTM_without.ipynb)
LSTM with FastText embedding (LSTM_fasttext.ipynb)
LSTM with Glove embedding (LSTM_glove.ipynb)
LSTM with Word2Vec embedding (LSTM_word2vec.ipynb)

We have used AUC_ROC Score to evaluate the performance of the models. These are the results:

Model	Mean AUC_ROC Score
Support Vector Machines (Binary Relevance)	0.66
Support Vector Machines (Classifier Chains)	0.67
Logistic Regression (Binary Relevance)	0.73
Logistic Regression (Classifier Chains)	0.76
Extra Trees	0.93
XGBoost	0.96
LSTM without pretrained embeddings	0.97
LSTM with FastText embedding	0.96
LSTM with Glove embedding	0.88
LSTM with Word2Vec embedding	0.85

mrsac7/Data-Mining-Project