This repository consist dataset mentioned in some papers.
- HateXplain: A Benchmark Dataset for Explainable Hate Speech Detection (source code)
- Automated Hate Speech Detection and the Problem of Offensive Language
- Task: Contributors viewed short text and identified if it a) contained hate speech, b) was offensive but without hate speech, or c) was not offensive at all. Contains nearly 15K rows with three contributor judgments per text string. (3 MB)
- TweetEval benchmark (Findings of EMNLP 2020)
- dataset
- problem
- provide a large-scale hate speech dataset
- performance (baseline)
- dataset
- problem
- models tend to have biases over group identifiers but unable to learn from context, which leads to false positives.
- methodology
- propose a novel regularization technique (Regularizing SOC explanations of group identifiers) based on these explanations that encourages models to learn from the context of group identifiers in addition to the identifiers themselves.
- performance
- dataset
- problem
- most work has focused on explicit or overt hate speech, failing to ad- dress a more pervasive form based on coded or indirect language.
- methodology
- introduce a theoretical taxonomy of implicit hate speech
- Provide a dataset for implicit hate speech.
- Provide several state of the art baselines for detecting and explaining implicit hate speech.
- performance
- dataset
- problem
- The deep learning methods of predecessors often only used pre-trained models or deeper networks to obtain semantic features, ignoring the sentiment features of the target sentences and external sentiment resources, which also makes the performance of neural networks unsatisfactory in hate speech detection.
- methodology
- performance
- dataset
- problem
- Identify who is the target in a given hate speech post.
- Identify what aspects (or characteristics) of the target are attributed to the target in the post.
- methodology
Lone Pine at SemEval-2021 Task 5: Fine-Grained Detection of Hate Speech Using BERToxic (source code)
- problem:
- Task: Toxic Spans Detection (SemEval 2021 Task 5)
- extract the list of toxic spans that attribute to a text’s toxicity
- Methodology: BERT + post-processing
- Performance: • Our system significantly outperformed the pro- vided baseline and achieved an F1-score of 0.683, placing Lone Pine in the 17th place out of 91 teams in the competition.