Toxic Comment Classification

Yet another toxic comment classification

Installation

Prerequisites

- Python 3.7 or higher
- GNU Make
- CUDA 10.2 or higher

Cloning

Clone the repo to your local machine:

git clone https://github.com/halecakir/toxic-comment-classification

Installating

Build the python virtual environment:

make venv/bin/activate

Fetching Data

Fetch wordvec data from multiple sources (glove, google-news, fasttext):

make fetch_all

Training

Train the model with the jigsaw data:

make train ARGS=WORD_VECTOR  # WORD_VECTOR ∈ {"google.bin", "fasttext.bin", "glove.txt"})

Testing

Test the model:

make test

Cleaning

Remove all model artifacts:

make clean

Todos

  • Try Attention mechanism
  • Try tranformers-based mechanismss
  • Try incorporation of hybrid (word level + character level) word vectors for words that have no pretrained vectors
  • Try Gradient clipping for exploding gradient
  • Add hyperparamerer optimization
  • Add sanity tests
  • Documentation!