Toxic Comment Classification Challenge

This repo contains the code that I wrote for Kaggle NLP challenge - Toxic Comment Classification

About the challenge (from Kaggle)

In this competition, you’re challenged to build a multi-headed model that’s capable of detecting different types of of toxicity like threats, obscenity, insults, and identity-based hate better than Perspective’s current models. You’ll be using a dataset of comments from Wikipedia’s talk page edits. Improvements to the current model will hopefully help online discussion become more productive and respectful.

Tools required

Python 3.5+
NLTK
Numpy
Pandas
Sklearn
Keras
Tensorflow
Glove word embedding
Fasttext word embedding

How to run

Model needs to be set in the fit_predict.py. The code trains the model with k fold cross validation. It save the trained model and predictions.

python fit_predict.py train_data_path test_data_path pretrained_embedding_path --result-path --sentences-length --fold-count --dense-size --modelname-prefix --batch-size --dropout-rate

Sample execution command

python fit_predict.py ./data/train.csv ./data/test.csv ./NLP/fasttext_embedding/crawl-300d-2M.vec --result-path ./toxic_results --sentences-length 400 --fold-count 10 --dense-size 256 --modelname-prefix dpcnn400_fasttextcrawl --batch-size 512 --dropout-rate 0.4

hpanwar08/kaggle-toxic-comment-classification

Toxic Comment Classification Challenge

About the challenge (from Kaggle)

Tools required

How to run

Sample execution command