/Data-Mining-Project

Hate Speech Detection | Data Mining (CSE-362) Project | IIT (BHU) Varanasi | Odd Semester 2020-21

Primary LanguageJupyter Notebook

Data-Mining-Project (CSE-362)

This repository contains source code for the Data Mining Project, IIT (BHU) Varanasi - Hate Speech Detection.

Guided By: Dr. Bhaskar Biswas, Associate Professor, CSE, IIT (BHU) Varanasi.

Group Members

Hate Speech / Toxic Comment detection

  • The project aims at improving the user experience of using any website for online chats, conversation and posts by flagging and removing the textual material containing hate and toxicity.
  • Given any text or paragraph containing a few lines in natural language (such as English), the objective is to classify it as belonging to one of the following categories:- normal, obscene, threatening, insulting, toxic, severely toxic and hate.
  • This is a multi-class classification problem as well as a multi-label classification problem, since a post can be abusive in multiple ways. The model will output the probability of the post belonging to each of the categories and based on a certain threshold (which can be tuned as a hyperparameter), a comment may be classified to be belonging to a category/set of categories

Dataset

The dataset has been taken from Conversation AI.

It consists of three files:

  • Training Set (train.csv): Contains comments with their labels (0 or 1).
  • Test Set (test.csv): We are required to predict the labels of these comments.
  • Labels for test data (test_labels.csv): To evaluate our predictions on the test set.

Embeddings

The download links of the pretrained embeddings used in the model:

File Structure

The code is written in .ipynb files, which contain both the code and their outputs:

Results

We have used AUC_ROC Score to evaluate the performance of the models. These are the results:

Model Mean AUC_ROC Score
Support Vector Machines (Binary Relevance) 0.66
Support Vector Machines (Classifier Chains) 0.67
Logistic Regression (Binary Relevance) 0.73
Logistic Regression (Classifier Chains) 0.76
Extra Trees 0.93
XGBoost 0.96
LSTM without pretrained embeddings 0.97
LSTM with FastText embedding 0.96
LSTM with Glove embedding 0.88
LSTM with Word2Vec embedding 0.85