Toxic Comments

tl;dr: Surfacing toxic Wikipedia comments, by training an NLP deep learning model utilizing multi-task learning a variety of deep learning architectures. Data from a Kaggle competition

Quick start

# Install anaconda environment
conda env create -f environment.yml 

# Activate environment
source activate toxic

# Run script
cd bin/
python main.py

Repo structure

bin/main.py: Code entry point
docs/writeup/writeup.md: Project summary
conf/confs.yaml: Configuration file, used to choose parameters
docs/modeling_notes.md: Notes / support for design decisions
data/schemas: Data set schemas

Python Environment

Python code in this repo utilizes packages that are not part of the common library. To make sure you have all of the appropriate packages, please install Anaconda, and install the environment described in environment.yml (Instructions here).

Configuration file

This program utilizes a configuration file (conf/confs.yaml). It will run with the default parameters, but many parameters can be freely changed. Parameters include:

run_train: Whether to train a model
run_infer: Whether to used the trained model to predict classifications for the Kaggle test data set
test_run: Whether to run on a subset of observations. This is helpful for debugging
model_choice: Either the name of a method in bin/models, or serialized. If serialized, use serialized_model_choice_path to provide a path to a serialized model
num_epochs: The number of epochs to train for

Contact