Data Integration for Toxic Comment Classification: Making More Than 40 Datasets Easily Accessible in One Unified Format
This repository contains the code of our paper Data Integration for Toxic Comment Classification: Making More Than 40 Datasets Easily Accessible in One Unified Format accepted for publication at this year's ACL workshop on Online Abuse and Harms (WOAH).
We present a collection of more than 40 datasets in the form of a software tool that automatizes downloading and processing of the data and presents them in a unified data format that also offers a mapping of compatible class labels. Another advantage of that tool is that it gives an overview of properties of available datasets, such as different languages, platforms, and class labels to make it easier to select suitable training and test data for toxic comment classification.
- Clone this repository and install via
pip install .
git clone git@github.com:julian-risch/toxic-comment-collection.git
cd toxic-comment-collection
pip install .
- Download individual datasets with the
get_dataset()
method into tab-separated files. A list of datasets is at the bottom of this web page.
from toxic_comment_collection import get_dataset
get_dataset('basile2019')
- It's as simple as that. You can now work with the dataset, for example, with pandas
import pandas as pd
df = pd.read_csv("./files/basile2019/basile2019en.csv", sep="\t")
df.head()
id text labels
0 0 Hurray, saving us $$$ in so many ways @potus @... ['hate']
1 1 Why would young fighting age men be the vast m... ['hate']
2 2 @KamalaHarris Illegals Dump their Kids at the ... ['hate']
3 3 NY Times: 'Nearly All White' States Pose 'an A... []
4 4 Orban in Brussels: European leaders are ignori... []
-
Some datasets require Twitter API credentials to be downloaded. Enter your Twitter API credentials in a file
api_config.json
with the followoing format. You can store that in a directory of your choice. For this example, it's stored in./src/toxic_comment_collection/api_config.json
. If you don't have Twitter API credentials yet, you can apply for them here.{ "twitter": { "consumer_key": "", "consumer_secret": "", "access_token": "", "access_token_secret": "" } }
-
After filling out
api_config.json
, theget_dataset()
method can use it as input:
get_dataset('albadi2018', api_config_path='./src/toxic_comment_collection/api_config.json')
df = pd.read_csv("./files/albadi2018/albadi2018ar_train.csv", sep="\t")
df.head()
id text labels
0 0 مؤسسة أرشيف المغرب تتسلم وثائق عن ذاكرة اليهود... ['none']
1 1 مفتي السعودية حماس إرهابية وقتال اليهود حرام ش... ['none']
2 2 أمراء ال سعود اليهود يخوضون حربا عن الصهيونيه ['hate']
3 3 تحميل كتاب مقارنة الأديان: اليهودية تأليف أحمد... ['none']
4 4 #هزه_ارضيه_في_جده\n\nهذه هيه الهزه الحقيقيه وت... ['hate']
- All datasets can be downloaded automatically, which will take some time. To respect rate limits of the Twitter API, the program might sleep for several minutes and then continue automatically.
from toxic_comment_collection import get_all_datasets
get_all_datasets(api_config_path='./src/toxic_comment_collection/api_config.json')
- After downloading all datasets, they can be combined into one large tab-separated file. To this end, the file
./src/toxic_comment_collection/config.json
defines the mappings of different labels to a common subset as described in our paper. You can download it here, as it is part of this GitHub repository. The resulting combined file is stored in./files/combined.tsv
. Note that we skip downloading all datasets in the following command assuming you have already downloaded them:
get_all_datasets(config_path="./src/toxic_comment_collection/config.json", skip_download=True, api_config_path='./src/toxic_comment_collection/api_config.json')
- A summary of all downloaded datasets can be generated with the
generate_statistics()
method:
from toxic_comment_collection import generate_statistics
generate_statistics('./files')
- It creates a file called
statistics.txt
:
######################
# Overall Statistics #
######################
rows: 812094
file size: 241226068
labels:
indirect: 13479
none: 471720
offensive: 66742
...
If you use our work, please cite our paper Data Integration for Toxic Comment Classification: Making More Than 40 Datasets Easily Accessible in One Unified Format that has been published at the ACL'21 Workshop on Online Abuse and Harms (WOAH) as follows:
@inproceedings{risch-etal-2021-toxic,
title = "Data Integration for Toxic Comment Classification: Making More Than 40 Datasets Easily Accessible in One Unified Format",
author = "Risch, Julian and Schmidt, Philipp and Krestel, Ralf",
booktitle = "Proceedings of the Workshop on Online Abuse and Harms (WOAH)",
month = aug,
year = "2021",
address = "Online",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2021.woah-1.17",
doi = "10.18653/v1/2021.woah-1.17",
pages = "157--163",
}
-
Create a new file in the
datasets
folder with the following naming scheme:<paper_author><paper_year><suffix (for duplicate file names)>.py
-
Complete the content of the file like in the following example:
import os from . import dataset from . import helpers class Mubarak2017aljazeera(dataset.Dataset): name = "mubarak2017aljazeera" url = "http://alt.qcri.org/~hmubarak/offensive/AJCommentsClassification-CF.xlsx" hash = "afa00e36ff5492c1bbdd42a0e4979886f40d00f1aa5517807a957e22fb517670" files = [ { "name": "mubarak2017ar_aljazeera.csv", "language": "ar", "type": "training", "platform": "twitter" } ] comment = """Annotation Meaning 0 NORMAL_LANGUAGE -1 OFFENSIVE_LANGUAGE -2 OBSCENE_LANGUAGE""" license = """ """
-
Overwrite the methods
process
andunify_row
from the parent class (dataset.Dataset
) to implement unpack and process the downloaded files. You might use methods fromdatasets/helpers.py
-
The resulting
.csv
should have the following columns:- id (added automatically)
- text
- labels
-
Add the newly created file and class to
datasets/helpers.py
. (import
+get_datasets()
method.) -
Make sure to update
config.json
to include the mapping of the labels of the new dataset to a common subset of the other datasets in the collection. -
We are happy to adopt your changes. Just create a pull request from your fork to this repository.
# | State | Name | Class |
---|---|---|---|
1 | Done | Are They our Brothers? Analysis and Detection of Religious Hate Speech in the Arabic Twittersphere | Albadi2018 |
2 | Done | Multilingual and Multi-Aspect Hate Speech Analysis (Arabic) | Ousidhoum2019 |
3 | Done | L-HSAB: A Levantine Twitter Dataset for Hate Speech and Abusive Language | mulki2019 |
4 | Done | Abusive Language Detection on Arabic Social Media (Twitter) | Mubarak2017twitter |
5 | Done | Abusive Language Detection on Arabic Social Media (Al Jazeera) | Mubarak2017aljazeera |
6 | Postponed (OneDrive) | Dataset Construction for the Detection of Anti-Social Behaviour in Online Communication in Arabic | |
7 | Postponed (Login Required) | Datasets of Slovene and Croatian Moderated News Comments | |
8 | tar.bz2 file | Offensive Language and Hate Speech Detection for Danish | |
9 | Done | Automated Hate Speech Detection and the Problem of Offensive Language | Davidson2017 |
10 | Done | Hate Speech Dataset from a White Supremacy Forum | Gibert2018 |
11 | Done | Hateful Symbols or Hateful People? Predictive Features for Hate Speech Detection on Twitter | Waseem2016 |
12 | Done | Detecting Online Hate Speech Using Context Aware Models | Gao2018 |
13 | Done | Are You a Racist or Am I Seeing Things? Annotator Influence on Hate Speech Detection on Twitter | Waseem2016 |
14 | Done | When Does a Compliment Become Sexist? Analysis and Classification of Ambivalent Sexism Using Twitter Data | Jha2017 |
15 | Password required | Overview of the Task on Automatic Misogyny Identification at IberEval 2018 (English) | |
16 | Done | CONAN - COunter NArratives through Nichesourcing: a Multilingual Dataset of Responses to Fight Online Hate Speech (English) | Chung2019 |
17 | Not suited | Characterizing and Detecting Hateful Users on Twitter | |
18 | Done | A Benchmark Dataset for Learning to Intervene in Online Hate Speech (Gab) | Qian2019 |
19 | Done | A Benchmark Dataset for Learning to Intervene in Online Hate Speech (Reddit) | Qian2019 |
20 | Done | Multilingual and Multi-Aspect Hate Speech Analysis (English) | Ousidhoum2019 |
21 | Postponed (includes pictures) | Exploring Hate Speech Detection in Multimodal Publications | |
22 | Uses OLID Dataset | Predicting the Type and Target of Offensive Posts in Social Media | |
23 | Done | SemEval-2019 Task 5: Multilingual Detection of Hate Speech AgainstImmigrants and Women in Twitter | Basile2019 |
24 | Done | Peer to Peer Hate: Hate Speech Instigators and Their Targets | ElSherief2018 |
25 | Done | Overview of the HASOC track at FIRE 2019: Hate Speech and Offensive Content Identification in Indo-European Languages | Mandl2019en |
26 | Done | Large Scale Crowdsourcing and Characterization of Twitter Abusive Behavior | Founta2018 |
27 | E-Mail required | A Large Labeled Corpus for Online Harassment Research | |
28 | Done | Ex Machina: Personal Attacks Seen at Scale, Personal attacks | Wulczyn2017attack |
29 | Done | Ex Machina: Personal Attacks Seen at Scale, Toxicity | Wulczyn2017toxic |
30 | Done | Detecting cyberbullying in online communities (World of Warcraft) | |
31 | Done | Detecting cyberbullying in online communities (League of Legends) | |
32 | E-Mail required | A Qality Type-aware Annotated Corpus and Lexicon for Harassment Research | Rezvan2018 |
33 | Done | Ex Machina: Personal Attacks Seen at Scale, Aggression and Friendliness | Wulczyn2017aggressive |
34 | Done | CONAN - COunter NArratives through Nichesourcing: a Multilingual Dataset of Responses to Fight Online Hate Speech (English) | Chung2019 |
35 | Done | Multilingual and Multi-Aspect Hate Speech Analysis (English) | Ousidhoum2019 |
36 | Done | Measuring the Reliability of Hate Speech Annotations:The Case of the European Refugee Crisis | Ross2017 |
37 | Done | Detecting Offensive Statements Towards Foreigners in Social Media | Bretschneider2017 |
38 | Done | Overview of the GermEval 2018 Shared Task on the Identification of Offensive Language | Wiegand2018 |
39 | Done | Overview of the HASOC track at FIRE 2019: Hate Speech and Offensive Content Identification in Indo-European Languages | Mandl2019ger |
40 | Website Down | Deep Learning for User Comment Moderation, Flagged Comments | |
41 | Website Down | Deep Learning for User Comment Moderation, Flagged Comments | |
42 | Done | Offensive Language Identification in Greek | Pitenis2020 |
43 | Application required | Aggression-annotated Corpus of Hindi-English Code-mixed Data | |
44 | Application required | Aggression-annotated Corpus of Hindi-English Code-mixed Data | |
45 | Done | Did you offend me? Classification of Offensive Tweets in Hinglish Language | Mathur2018 |
46 | Dataset not available | A Dataset of Hindi-English Code-Mixed Social Media Text for HateSpeech Detection | |
47 | Done | Overview of the HASOC track at FIRE 2019: Hate Speech and Offensive Content Identification in Indo-European Languages | Mandl2019hind |
48 | Done | Hate Speech Detection in the Indonesian Language: A Dataset and Preliminary Study | Alfina2018 |
49 | Done | Multi-Label Hate Speech and Abusive Language Detection in Indonesian Twitter | Ibrohim2019 |
50 | Done | A Dataset and Preliminaries Study for Abusive Language Detection in Indonesian Social Media | Ibrohim2018 |
51 | Done | An Italian Twitter Corpus of Hate Speech against Immigrants | Sanguinetti2018 |
52 | Application required | Overview of the EVALITA 2018 Hate Speech Detection Task (Facebook) | |
53 | Application required | Overview of the EVALITA 2018 Hate Speech Detection Task (Twitter) | |
54 | Done | CONAN - COunter NArratives through Nichesourcing: a Multilingual Dataset of Responses to Fight Online Hate Speech (English) | Chung2019 |
55 | XML import required | Creating a WhatsApp Dataset to Study Pre-teen Cyberbullying | |
56 | Files not found | Results of the PolEval 2019 Shared Task 6:First Dataset and Open Shared Task for Automatic Cyberbullying Detection in Polish Twitter | |
57 | Done | A Hierarchically-Labeled Portuguese Hate Speech Dataset | Fortuna2019 |
58 | ARFF file format | Offensive Comments in the Brazilian Web: A Dataset and Baseline Results | |
59 | Encrypted | Datasets of Slovene and Croatian Moderated News Comments | |
60 | Application required | Overview of MEX-A3T at IberEval 2018: Authorship and Aggressiveness Analysis in Mexican Spanish Tweets | |
61 | |||
62 | Data not found | hatEval, SemEval-2019 Task 5: Multilingual Detection of Hate Speech Against Immigrants and Women in Twitter (Spanish) | |
63 | Done | A Corpus of Turkish Offensive Language on Social Media | Coltekin2019 |
64 | Done | Aggression-annotated corpus of hindi-english code-mixed data | Kumar2018 |
65 | Done | Predicting the Type and Target of Offensive Posts in Social Media | Zampieri2019 |
This is a utility library that downloads and transforms publicly available datasets. We do not host or distribute these datasets, vouch for their quality or fairness, or claim that you have license to use them. It is your responsibility to determine whether you have permission to use the dataset under the dataset's license.
If you wish to add, update or remove a dataset, please get in touch through a GitHub pull request.