/kaggle-toxic-comments-classification

Kaggle Competition involving a Natural Language Processing task

Primary LanguageJupyter Notebook

Toxic Comments Classification

Aim:

In a recent Kaggle competition, participants were invited to explore statistical techniques that would improve the detection abilities of the Perspective tool. The given dataset consisted of numerous comments taken from Wikipedia talk page edits, that were labeled by human raters as toxic behavior based on the following types of toxicity: ‘toxic’, ‘severe toxic’, ‘obscene’, ‘threat’, ‘insult’ and ‘identity hate’. In a supervised learning framework, participants were asked to build a model that estimated the probabilities of comments being categorized as toxic respective to each level of toxicity.

Our project aims to help improve online conversation by exploring various Machine Learning, Deep Learning and NLP methods to build an accurate model that’s capable of detecting diverse types of negative online comments perceived as toxic.

Some of the following steps will be followed during the course of the project.

  1. Data Preprocessing
  • Noise Removal
  • Lexicon Normalization
  • Lemmatization
  • Stemming
  1. Feature Extraction:
  • Statistical features
  • TF–IDF
  1. Methods of Model Building:
  • Machine Learning - Logistic Regression, Naive Bayes, Random Forest, XGBoost
  • Deep Learning - LSTM
  1. Model Evaluation:
  • ROC curves
  • Hamming score for Multilabel Evaluation