Toxic comments detection

Based on Alexander Rashkin keras implementation of Yoon Kim CNN for natural texts classification

Prepared for Social Weekend Hackaton 13

Deployment and usage

First, download dataset, preprocess it and train a model.

# Outdated, use toxicity.pos and toxicity.neg from repo instead

$ kaggle competitions download jigsaw-toxic-comment-classification-challenge
$ unzip train.csv.zip
$ jupyter nbconvert --execute ./dataset.ipynb

Then execute:

$ python3 ./train.py

After these steps you will be able to run flask server

$ python3 ./server.py

Example of server response goes below:

$ curl -XPOST -d '{"id1": "Дальше вы не пройдете, пока не получите бумаги", "id2": "Ваша мать дает"}' localhost:5000/api
{
  "id1": 0.03455933555960655,
  "id2": 0.4351857006549835
}

Convolutional neural network

Train convolutional network for toxicity detection. Based on "Convolutional Neural Networks for Sentence Classification" by Yoon Kim, link. Inspired by Denny Britz article "Implementing a CNN for Text Classification in TensorFlow", link. For "CNN-rand" and "CNN-non-static" gets to 88-90%, and "CNN-static" - 85%

Some difference from original article:

larger corpus, longer sentences; sentence length is very important, just like data size
smaller embedding dimension, 50 instead of 300
2 filter sizes instead of original 3
much fewer filters; experiments show that 3-10 is enough; original work uses 100
random initialization is no worse than word2vec init on IMDB corpus
sliding Max Pooling instead of original Global Pooling

Dependencies

The Keras Deep Learning library and most recent Theano backend should be installed. You can use pip for that. Not tested with TensorFlow, but should work.

snail-fuji/troll2vec

Toxic comments detection

Deployment and usage

Convolutional neural network

Some difference from original article:

Dependencies