This repository documents Kishan Manani and Dat Nguyen's submission to the Toxic Comments Classification Challenge hosted on Kaggle. A variety of methods and tools were explored, these included: Bi-directional LSTMs with word embeddings using Keras, gradient boosted trees using LigthGBM and XGBoost, Logistic Regression, and LASSO along with standard text processing methods such as TF-IDF. We also used model stacking, also known as blending or ensembling.
Here are some of the modelling ideas we explored during the competition.
- Exploratory data analysis
- Baseline model using logistic regression with TF-IDF features
- Gradient boosting
- XGBoost on features selected from LASSO
- The multi-label nature of the target is handled through classifier chaining (which allows the model to learn correlations between labels)
- Bi-directional LSTMs with word embeddings
- Google News word embeddings (pre-trained word2vec)
- GloVe
- This was individually the best model.
- Model ensembling