Toxic Comments
tl;dr: Surfacing toxic Wikipedia comments, by training an NLP deep learning model utilizing multi-task learning a variety of deep learning architectures. Data from a Kaggle competition
Quick start
# Install anaconda environment
conda env create -f environment.yml
# Activate environment
source activate toxic
# Run script
cd bin/
python main.py
Repo structure
bin/main.py
: Code entry pointdocs/writeup/writeup.md
: Project summaryconf/confs.yaml
: Configuration file, used to choose parametersdocs/modeling_notes.md
: Notes / support for design decisionsdata/schemas
: Data set schemas
Python Environment
Python code in this repo utilizes packages that are not part of the common library. To make sure you have all of the appropriate packages, please install Anaconda, and install the environment described in environment.yml (Instructions here).
Configuration file
This program utilizes a configuration file (conf/confs.yaml
). It will run with the default parameters, but many
parameters can be freely changed. Parameters include:
run_train
: Whether to train a modelrun_infer
: Whether to used the trained model to predict classifications for the Kaggle test data settest_run
: Whether to run on a subset of observations. This is helpful for debuggingmodel_choice
: Either the name of a method inbin/models
, or serialized. If serialized, useserialized_model_choice_path
to provide a path to a serialized modelnum_epochs
: The number of epochs to train for
Contact
Feel free to contact me at 13herger <at>
gmail <dot>
com