samtaylor54321/toxicity_in_classification

Kaggle toxicity in classification competition

Jupyter Notebook

Kaggle Jigsaw Bias in Toxicity Classification challenge

Repo for Kaggle Jigsaw Competition

https://www.kaggle.com/c/jigsaw-unintended-bias-in-toxicity-classification/data

External Datasets/Sources - Must be declared if used

Reddit Corpus

https://www.reddit.com/r/datasets/comments/65o7py/updated_reddit_comment_dataset_as_torrents/

Youtube Corpus

https://www.kaggle.com/datasnaek/youtube

Swearword Corpus

http://www.bannedwordlist.com/

Emoji Sentiment Ranking

http://kt.ijs.si/data/Emoji_sentiment_ranking/index.html

Urban Dictionary Corpus

https://www.kaggle.com/therohk/urban-dictionary-words-dataset

Academic Research

Hate Speech Classifier

https://aaai.org/ocs/index.php/ICWSM/ICWSM17/paper/view/15665/14843 (Paper)

https://github.com/t-davidson/hate-speech-and-offensive-language (Github)

API

Perspective API

https://www.perspectiveapi.com/#/

Scores

Metric = competition's custom bias metric

Model	Embedding	Comment	Local CV score	Kaggle leaderboard score
Single LSTM	Custom word2vec	Default stopwords	0.9191