Repo for Kaggle Jigsaw Competition
https://www.kaggle.com/c/jigsaw-unintended-bias-in-toxicity-classification/data
Reddit Corpus
https://www.reddit.com/r/datasets/comments/65o7py/updated_reddit_comment_dataset_as_torrents/
Youtube Corpus
https://www.kaggle.com/datasnaek/youtube
Swearword Corpus
http://www.bannedwordlist.com/
Emoji Sentiment Ranking
http://kt.ijs.si/data/Emoji_sentiment_ranking/index.html
Urban Dictionary Corpus
https://www.kaggle.com/therohk/urban-dictionary-words-dataset
Academic Research
Hate Speech Classifier
https://aaai.org/ocs/index.php/ICWSM/ICWSM17/paper/view/15665/14843 (Paper)
https://github.com/t-davidson/hate-speech-and-offensive-language (Github)
API
Perspective API
https://www.perspectiveapi.com/#/
Metric = competition's custom bias metric
Model | Embedding | Comment | Local CV score | Kaggle leaderboard score |
---|---|---|---|---|
Single LSTM | Custom word2vec | Default stopwords | 0.9191 |