/troll2vec

Convolutional neural network for toxicity classification. Works with russian language

Primary LanguageJupyter NotebookMIT LicenseMIT

Toxic comments detection

Based on Alexander Rashkin keras implementation of Yoon Kim CNN for natural texts classification

Prepared for Social Weekend Hackaton 13

Related to the extension

Deployment and usage

First, download dataset, preprocess it and train a model.

# Outdated, use toxicity.pos and toxicity.neg from repo instead

$ kaggle competitions download jigsaw-toxic-comment-classification-challenge
$ unzip train.csv.zip
$ jupyter nbconvert --execute ./dataset.ipynb

Then execute:

$ python3 ./train.py

After these steps you will be able to run flask server

$ python3 ./server.py

Example of server response goes below:

$ curl -XPOST -d '{"id1": "Дальше вы не пройдете, пока не получите бумаги", "id2": "Ваша мать дает"}' localhost:5000/api
{
  "id1": 0.03455933555960655,
  "id2": 0.4351857006549835
}

Convolutional neural network

Train convolutional network for toxicity detection. Based on "Convolutional Neural Networks for Sentence Classification" by Yoon Kim, link. Inspired by Denny Britz article "Implementing a CNN for Text Classification in TensorFlow", link. For "CNN-rand" and "CNN-non-static" gets to 88-90%, and "CNN-static" - 85%

Some difference from original article:

  • larger corpus, longer sentences; sentence length is very important, just like data size
  • smaller embedding dimension, 50 instead of 300
  • 2 filter sizes instead of original 3
  • much fewer filters; experiments show that 3-10 is enough; original work uses 100
  • random initialization is no worse than word2vec init on IMDB corpus
  • sliding Max Pooling instead of original Global Pooling

Dependencies

  • The Keras Deep Learning library and most recent Theano backend should be installed. You can use pip for that. Not tested with TensorFlow, but should work.