Toxic Comments Classification Challenge

This repository documents Kishan Manani and Dat Nguyen's submission to the Toxic Comments Classification Challenge hosted on Kaggle. A variety of methods and tools were explored, these included: Bi-directional LSTMs with word embeddings using Keras, gradient boosted trees using LigthGBM and XGBoost, Logistic Regression, and LASSO along with standard text processing methods such as TF-IDF. We also used model stacking, also known as blending or ensembling.

Notebook kernels

Here are some of the modelling ideas we explored during the competition.

Exploratory data analysis
Baseline model using logistic regression with TF-IDF features
- Multi-label feature selection using LASSO
Gradient boosting
- XGBoost on features selected from LASSO
- The multi-label nature of the target is handled through classifier chaining (which allows the model to learn correlations between labels)
Bi-directional LSTMs with word embeddings
- Google News word embeddings (pre-trained word2vec)
- GloVe
- This was individually the best model.
Model ensembling

KishManani/kishan-dat-toxic-comment-challenge

Toxic Comments Classification Challenge

Notebook kernels