liuchaoqun/hate-speech-datasets

Jupyter Notebook

hate-speech-datasets

This repository consist dataset mentioned in some papers.

HateXplain: A Benchmark Dataset for Explainable Hate Speech Detection (source code)
Automated Hate Speech Detection and the Problem of Offensive Language
- Task: Contributors viewed short text and identified if it a) contained hate speech, b) was offensive but without hate speech, or c) was not offensive at all. Contains nearly 15K rows with three contributor judgments per text string. (3 MB)
TweetEval benchmark (Findings of EMNLP 2020)

Papers

The Gab Hate Corpus: A collection of 27k posts annotated for hate speech

dataset
- GabHateCorpus
problem
- provide a large-scale hate speech dataset
performance (baseline)

Contextualizing Hate Speech Classifiers with Post-hoc Explanation (source code)

dataset
- GabHateCorpus
- Stormfront corpus(paper)
problem
- models tend to have biases over group identifiers but unable to learn from context, which leads to false positives.
methodology
- propose a novel regularization technique (Regularizing SOC explanations of group identifiers) based on these explanations that encourages models to learn from the context of group identifiers in addition to the identifiers themselves.
performance

Latent Hatred: A Benchmark for Understanding Implicit Hate Speech(source code)

dataset
- implicit-hate-corpus
problem
- most work has focused on explicit or overt hate speech, failing to ad- dress a more pervasive form based on coded or indirect language.
methodology
- introduce a theoretical taxonomy of implicit hate speech
- Provide a dataset for implicit hate speech.
- Provide several state of the art baselines for detecting and explaining implicit hate speech.
performance

Hate Speech Detection Based on Sentiment Knowledge Sharing (source code)

dataset
problem
- The deep learning methods of predecessors often only used pre-trained models or deeper networks to obtain semantic features, ignoring the sentiment features of the target sentences and external sentiment resources, which also makes the performance of neural networks unsatisfactory in hate speech detection.
methodology
- propose a hate speech detection framework based on sentiment knowledge sharing
performance

Targets and Aspects in Social Media Hate Speech (source code)

dataset
- data
problem
- Identify who is the target in a given hate speech post.
- Identify what aspects (or characteristics) of the target are attributed to the target in the post.
methodology

Lone Pine at SemEval-2021 Task 5: Fine-Grained Detection of Hate Speech Using BERToxic (source code)

problem:
- Task: Toxic Spans Detection (SemEval 2021 Task 5)
- extract the list of toxic spans that attribute to a text’s toxicity
Methodology: BERT + post-processing
Performance: • Our system significantly outperformed the pro- vided baseline and achieved an F1-score of 0.683, placing Lone Pine in the 17th place out of 91 teams in the competition.