A curated list of awesome datasets with human label variation (un-aggregated labels) in Natural Language Processing and Computer Vision, including links to related initiatives and key references. The key focus of the table provided below is to collect datasets that contain multiple annotations per instance, to enable learning with human label variation/disagreement. The starting point of Table 1 was the table in the appendix of our paper.
If you know of resources or papers or links that are not yet listed, please help grow this resource. You can contribute by creating a pull request as outlined in contributing.md.
Please cite our paper Plank, 2022 EMNLP if you find this repository useful:
@inproceedings{plank-2022-emnlp,
title = "The ``Problem'' of Human Label Variation: On Ground Truth in Data, Modeling and Evaluation",
author = "Plank, Barbara",
booktitle = "Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing ",
month = December,
year = "2022",
address = "Abu Dhabi",
publisher = "Association for Computational Linguistics",
}
Icons refer to the following:
- π π SemEval 2023 Shared Task 11 on Learning with Disagreement (Le-Wi-Di): 2nd Shared task on subjective NLP tasks π on-going!
- π€· SemEval 2021 Shared Task 11 on Learning with Disagreement: 1st Shared task, which included core NLP and computer vision tasks
- π₯§ Perspectivist Data Manifesto (PDAI): Website that contains key references and a first list of non-aggregated datasets
- π£οΈ NLPerspectives 2022, Workshop on Perspectivist Approaches to NLP held at LREC 2022; 2nd edition 2023 Workshop on Perspectivist Approaches to NLP co-located with ECAI 2023
- π Uma et al., 2021: Learning from Disagreement: A Survey. Broad overview across NLP and computer vision tasks.
- Plank et al., 2014. Learning part-of-speech taggers with inter-annotator agreement loss. Proposed to leverage small samples of un-aggregated data to improve performance on morphosyntactic NLP tasks. Inspired follow-up work such as Linguistically debatable or just plain wrong? ACL 2014. Analysis of systematicity of annotator agreement on objective linguistic annotation tasks (POS tagging).
- Aroyo & Welty, 2015. Truth is a lie: Crowd truth and the seven myths of human annotation. AI Magazine. Proposes the crowd truth framework, which included a large body of work on medical relation extraction, frame disambiguation and other semantic processing tasks.
- Pavlick & Kwiatkowski, 2019. Inherent Disagreements in Human Textual Inferences. TACL. Seminal work that illustrates plausible disagreement in entailment datasets. Inspired follow-up work such as dataset re-annotation studies like ChaosNLI by Nie et al., 2020 and follow-up work such as embracing the collective human opinion for NLI.
- Alm, 2011. Subjective Natural Language Problems: Motivations, Applications, Characterizations, and Implications. Early paper discussing annotator agreement on subjective linguistic annotation tasks.
- Basile et al., 2021. Toward a Perspectivist Turn in Ground Truthing for Predictive Computing. Conference of the Italian Chapter of the Association for Intelligent Systems (ItAIS 2021). Putting forward data perspectivism to embrace human perspectives. Inspired a lot of follow-up work on subjective tasks (see e.g. the Le-Wi-Di 2023 shared task)
- Gordon et al., 2021. The Disagreement Deconvolution: Bringing Machine Learning Performance Metrics In Line With Reality. Seminal paper in the Human-Computer-Interaction (CHI) conference.
- π Davani et al., 2022. Dealing with Disagreements: Looking Beyond the Majority Vote in Subjective Annotations TACL. Examines, a.o., whether the uncertainty in predictions is correlated with whether the multi-task model was able to correctly predict the majority label.
- Jiang & de Marneffe, 2022. Investigating Reasons for Disagreement in Natural Language Inference. TACL. Provides a novel linguistic taxonomy to characterize disagreements in natural language inference datasets.
- πΈ Wan et al., 2023. Everyoneβs Voice Matters: Quantifying Annotation Disagreement Using Demographic Information. AAAI. Predict human label variation on five subjective tasks, examine demographic information.
This list above are selected key references. Please see our EMNLP 2022 theme paper (Plank, 2022) for further references related to annotator culture/backgrounds, different terms proposed in the literature and more. If you know of relevant related work (not datasets), please leave an Issue. For more datasets, please see contributing.md
Reference | Name or Description | URL | Used in/Listed on |
---|---|---|---|
Passonneau et al., 2010 | Word sense disambiguation (WSD) | https://anc.org/ | |
Plank et al., 2014 | Part-of-Speech (POS) tagging, 500 tweets from Lowlands and Gimpel POS | https://bitbucket.org/lowlands/costsensitive-data/ or https://zenodo.org/record/5130737 | π, π€· |
Derczynski et al., 2016 | Broad Named Entity Recognition (NER) Twitter dataset | https://github.com/GateNLP/broad_twitter_corpus | π₯§ |
Rodrigues et al., 2018 | NER dataset, re-annoted sample of CoNLL 2003 | http://fprodrigues.com//publications/deep-crowds/ | |
Martinez Alonso et al., 2016 | Supersense tagging | https://github.com/coastalcph/semdax | |
Berzak et al., 2016 | Dependency Parsing, WSJ-23, 4 annotators | https://people.csail.mit.edu/berzak/agreement/ | |
Peng et al., 2022 | GCDT, Mandarin Chinese discourse treebank, small subsection with double annotations | https://github.com/logan-siyao-peng/GCDT/tree/main/data | |
Bryant and Ng, 2015 | Grammatical error correction | http://www.comp.nus.edu.sg/~nlp/sw/10gec_annotations.zip | |
Poesio et al. 2019 | PD (Phrase Detectives dataset): Anaphora and Information Status Classification | https://github.com/dali-ambiguity/Phrase-Detectives-Corpus-2.1.4 | π, π€· |
Dumitrache et al. 2018 | Medical Relation Extraction (MRE) | https://github.com/CrowdTruth/Open-Domain-Relation-Extraction | π |
Bassignana and Plank, 2022 | CrossRE, relation extraction, small doubly-annotated subset | https://github.com/mainlp/CrossRE | |
Dumitrache et al. 2018 | Frame Disambiguation | https://github.com/CrowdTruth/FrameDisambiguation | |
Snow et al. 2008 | RTE (recognizing textual entailment; 800 hypothesis-premise pairs) collected by Dagan et al. 2005, re-annotated; includes further datasets on temporal ordering, WSD, word similarity and affective text | https://sites.google.com/site/nlpannotations/ | π |
Pavlick and Kwiatkowski 2019 | NLI (natural language inference) inherent disagreement dataset, approx. 500 RTE instances from Dagan et al. 2005 re-annotated by 50 annotators | https://github.com/epavlick/NLI-variation-data | |
Nie et al., 2020 | ChaosNLI, large NLI dataset re-annotated by 100 annotators | https://github.com/easonnie/ChaosNLI | |
Demszky et al., 2020 | GoEmotions: reddit comments annotated for 27 emotion categories or neutral | https://github.com/google-research/google-research/tree/master/goemotions | π |
Ferracane et al., 2021 | Subjective discourse: conversation acts and intents | https://github.com/elisaF/subjective_discourse | |
Damgaard et al., 2021 | Understanding indirect answers to polar questions | https://github.com/friendsQIA/Friends_QIA | |
de Marneffe et al., 2019 | CommitmentBank: 8 annotations indicating the extent to which the speakers are committed to the truth of the embedded clause | https://github.com/mcdm/CommitmentBank | |
Kennedy et al., 2020 | Hate speech detection | https://huggingface.co/datasets/ucberkeley-dlab/measuring-hate-speech | π₯§, π |
Dinu et al., 2021 | Pejorative words dataset | https://nlp.unibuc.ro/resources or http://pdai.info/ | π₯§ |
Leonardelli et al., 2021 | MultiDomain Agreement, Offensive language detection on Twitter, 5 offensive/non-offensive labels; also part of Le-Wi-Di SemEval23 | https://github.com/dhfbk/annotators-agreement-dataset/ | π π, π₯§ |
Cercas Curry et al., 2021 | ConvAbuse, abusive language towards three conversational AI systems; also part of Le-Wi-Di SemEval23 | https://github.com/amandacurry/convabuse | π π, π₯§ |
Liu et al., 2019 | Work and Well-being Job-related Tweets, 5 annotators | https://github.com/Homan-Lab/pldl_data | π₯§ |
Simpson et al., 2019 | Humour: pairwise funniness judgements | https://zenodo.org/record/5130737 | π€· |
Akhtar et al., 2019 | HS-brexit; Abusive Language on Brexit and annotated for hate speech (HS), aggressiveness and offensiveness, 6 annotators, extended and new parts part of Le-Wi-Di SemEval23 | https://le-wi-di.github.io/ | π π |
Almanea and Poesio 2022 | ArMIS; New Le-Wi-Di SemEval23 dataset on Arabic tweets annotated for misogyny detection | https://le-wi-di.github.io/ | π π |
Sap et al., 2022 | Annotators with Attitudes: How Annotator Beliefs And Identities Bias Toxic Language Detection | http://maartensap.com/racial-bias-hatespeech/ | |
Kumar et al., 2021 | Designing Toxic Content Classification for a Diversity of Perspectives | https://data.esrg.stanford.edu/study/toxicity-perspectives (contact author for password) | |
Nguyen et al., 2017 | Biomedical Infomation Retrieval, each doc is annotated by roughly 5 Amazon Mechanical Turk workers | https://github.com/yinfeiy/PICO-data | |
Zhang et al., 2022 | Chinese Sentiment Words Identification, each sentence is annotated by 3 ~ 5 workers | https://github.com/izhx/crowd-OEI | |
Grubenmann et al., 2018 | Sentiment annotations for Swiss German sentences | https://github.com/spinningbytes/SB-CH | |
Ji et al., 2022 | KiloGram tangram dataset, 10 annotations per tangram (EMNLP 2022 best long paper award) | https://github.com/lil-lab/kilogram | |
Kennedy et al., 2020 | The gab hate corpus: A collection of 27k posts annotated for hate speech. [#Labels: 2, #Unique Raters: 18, Atleast 3 annotations per instance] | https://osf.io/edua3/ | |
Haber et al., 2023 | SOA: Singapore online attacks, multilingual toxic data annotated with 3 annotators. | https://github.com/rewire-online/singapore-online-attacks/tree/main | |
Liu et al., 2022 | Word Associations with 19K explanations and 725 relation labels from 5 annotators | https://github.com/ChunhuaLiu596/WAX/ | |
Frermann et al., 2023 | Multi-label frame annotations of 428 news articles, each labeled by 2-3 annotators | https://github.com/phenixace/narrative-framing/tree/main/data | |
Sap et al., 2020 | Social Bias Frames: Reasoning about Social and Power Implications of Language (3 annotators) | https://maartensap.com/social-bias-frames/ | πΈ |
Fleisig et al., 2023 | FairPrism: Evaluating Fairness-Related Harms in Generated Text (3 annotators) | https://github.com/microsoft/FairPrism | |
Forbes et al., 2020 | Social Chemistry 101: Learning to Reason about Social and Moral Norms (up to 5 crowd annotations) | https://github.com/mbforbes/social-chemistry-101 | πΈ |
Lourie et al., 2021 | Scruples-dilemmas: A Corpus of Community Ethical Judgments (with 5 crowd annotations per instance) | https://github.com/allenai/scruples | πΈ |
Potts et al., 2021 | Dyna-Sentiment (5 crowd annotations) | https://github.com/cgpotts/dynasent | πΈ |
Danescu-Niculescu-Mizil et al. 2013 | Wikipedia Politeness (with up to 5 crowd annotations) | https://convokit.cornell.edu/documentation/wiki_politeness.html or https://github.com/minnesotanlp/Quantifying-Annotation-Disagreement | πΈ |
Madeddu et al., 2023 | DisaggregHateIt: A Disaggregated Italian Dataset of Hate Speech (1.1k tweets annotated for hate, irony, stance; between 1 and 13 annotations per instance) | https://github.com/madeddumarco/DisaggregHateIt |
Reference | Name or Description | URL |
---|---|---|
Rodrigues et al. 2018 | LabelMe: Image classification dataset with 8 categories, re-annotated | http://fprodrigues.com//publications/deep-crowds/ |
Peterson et al., 2019 | Cifar10H: Image classification with 10 categories, re-annotated | http://github.com/jcpeterson/cifar-10h |
Cheplygina et al. 2018 | Medical lesion classification challenge, 6 annotators each | https://figshare.com/s/5cbbce14647b66286544 |
Wei, Zhu et al., 2022 | CIFAR-100N | http://noisylabels.com/ |
Nguyen et al. 2020 | VinDR-CXR: Object detection dataset on chest x-ray images, each training image labeled by 3 annotators | https://www.kaggle.com/c/vinbigdata-chest-xray-abnormalities-detection/ or https://vindr.ai/datasets/cxr |
Tschirschwitz et al. 2022 | TexBiG: Instance segmentation dataset on historical layout analysis, each training image labeled by 2-4 annotators | https://zenodo.org/record/8347059 or https://www.kaggle.com/datasets/davidtschirschwitz/texbig-v2-0-train-val |