/awesome-human-label-variation

A curated list of awesome datasets with human label variation (un-aggregated labels) in Natural Language Processing and Computer Vision, accompanying The 'Problem' of Human Label Variation: On Ground Truth in Data, Modeling and Evaluation (EMNLP 2022)

awesome-human-label-variation

The "Problem" of Human Label Variation: On Ground Truth in Data, Modeling and Evaluation

MIT License Awesome

A curated list of awesome datasets with human label variation (un-aggregated labels) in Natural Language Processing and Computer Vision, including links to related initiatives and key references. The key focus of the table provided below is to collect datasets that contain multiple annotations per instance, to enable learning with human label variation/disagreement. The starting point of Table 1 was the table in the appendix of our paper.

🧩 Something not listed?

If you know of resources or papers or links that are not yet listed, please help grow this resource. You can contribute by creating a pull request as outlined in contributing.md.

πŸŽ“ Citing

Please cite our paper Plank, 2022 EMNLP if you find this repository useful:

@inproceedings{plank-2022-emnlp,
    title = "The ``Problem'' of Human Label Variation: On Ground Truth in Data, Modeling and Evaluation",
    author = "Plank, Barbara",
    booktitle = "Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing ",
    month = December,
    year = "2022",
    address = "Abu Dhabi",
    publisher = "Association for Computational Linguistics",
}

Human Label Variation - Related Initiatives and further reading

Initiatives, Evaluation Campaigns and Workshops

Icons refer to the following:

Survey and Key Selected References

This list above are selected key references. Please see our EMNLP 2022 theme paper (Plank, 2022) for further references related to annotator culture/backgrounds, different terms proposed in the literature and more. If you know of relevant related work (not datasets), please leave an Issue. For more datasets, please see contributing.md

πŸ“Š Datasets

NLP datasets

Reference Name or Description URL Used in/Listed on
Passonneau et al., 2010 Word sense disambiguation (WSD) https://anc.org/
Plank et al., 2014 Part-of-Speech (POS) tagging, 500 tweets from Lowlands and Gimpel POS https://bitbucket.org/lowlands/costsensitive-data/ or https://zenodo.org/record/5130737 πŸ”, 🀷
Derczynski et al., 2016 Broad Named Entity Recognition (NER) Twitter dataset https://github.com/GateNLP/broad_twitter_corpus πŸ₯§
Rodrigues et al., 2018 NER dataset, re-annoted sample of CoNLL 2003 http://fprodrigues.com//publications/deep-crowds/
Martinez Alonso et al., 2016 Supersense tagging https://github.com/coastalcph/semdax
Berzak et al., 2016 Dependency Parsing, WSJ-23, 4 annotators https://people.csail.mit.edu/berzak/agreement/
Peng et al., 2022 GCDT, Mandarin Chinese discourse treebank, small subsection with double annotations https://github.com/logan-siyao-peng/GCDT/tree/main/data
Bryant and Ng, 2015 Grammatical error correction http://www.comp.nus.edu.sg/~nlp/sw/10gec_annotations.zip
Poesio et al. 2019 PD (Phrase Detectives dataset): Anaphora and Information Status Classification https://github.com/dali-ambiguity/Phrase-Detectives-Corpus-2.1.4 πŸ”, 🀷
Dumitrache et al. 2018 Medical Relation Extraction (MRE) https://github.com/CrowdTruth/Open-Domain-Relation-Extraction πŸ”
Bassignana and Plank, 2022 CrossRE, relation extraction, small doubly-annotated subset https://github.com/mainlp/CrossRE
Dumitrache et al. 2018 Frame Disambiguation https://github.com/CrowdTruth/FrameDisambiguation
Snow et al. 2008 RTE (recognizing textual entailment; 800 hypothesis-premise pairs) collected by Dagan et al. 2005, re-annotated; includes further datasets on temporal ordering, WSD, word similarity and affective text https://sites.google.com/site/nlpannotations/ πŸ”
Pavlick and Kwiatkowski 2019 NLI (natural language inference) inherent disagreement dataset, approx. 500 RTE instances from Dagan et al. 2005 re-annotated by 50 annotators https://github.com/epavlick/NLI-variation-data
Nie et al., 2020 ChaosNLI, large NLI dataset re-annotated by 100 annotators https://github.com/easonnie/ChaosNLI
Demszky et al., 2020 GoEmotions: reddit comments annotated for 27 emotion categories or neutral https://github.com/google-research/google-research/tree/master/goemotions πŸ‘“
Ferracane et al., 2021 Subjective discourse: conversation acts and intents https://github.com/elisaF/subjective_discourse
Damgaard et al., 2021 Understanding indirect answers to polar questions https://github.com/friendsQIA/Friends_QIA
de Marneffe et al., 2019 CommitmentBank: 8 annotations indicating the extent to which the speakers are committed to the truth of the embedded clause https://github.com/mcdm/CommitmentBank
Kennedy et al., 2020 Hate speech detection https://huggingface.co/datasets/ucberkeley-dlab/measuring-hate-speech πŸ₯§, πŸ‘“
Dinu et al., 2021 Pejorative words dataset https://nlp.unibuc.ro/resources or http://pdai.info/ πŸ₯§
Leonardelli et al., 2021 MultiDomain Agreement, Offensive language detection on Twitter, 5 offensive/non-offensive labels; also part of Le-Wi-Di SemEval23 https://github.com/dhfbk/annotators-agreement-dataset/ πŸ‘ πŸ‘Ž, πŸ₯§
Cercas Curry et al., 2021 ConvAbuse, abusive language towards three conversational AI systems; also part of Le-Wi-Di SemEval23 https://github.com/amandacurry/convabuse πŸ‘ πŸ‘Ž, πŸ₯§
Liu et al., 2019 Work and Well-being Job-related Tweets, 5 annotators https://github.com/Homan-Lab/pldl_data πŸ₯§
Simpson et al., 2019 Humour: pairwise funniness judgements https://zenodo.org/record/5130737 🀷
Akhtar et al., 2019 HS-brexit; Abusive Language on Brexit and annotated for hate speech (HS), aggressiveness and offensiveness, 6 annotators, extended and new parts part of Le-Wi-Di SemEval23 https://le-wi-di.github.io/ πŸ‘ πŸ‘Ž
Almanea and Poesio 2022 ArMIS; New Le-Wi-Di SemEval23 dataset on Arabic tweets annotated for misogyny detection https://le-wi-di.github.io/ πŸ‘ πŸ‘Ž
Sap et al., 2022 Annotators with Attitudes: How Annotator Beliefs And Identities Bias Toxic Language Detection http://maartensap.com/racial-bias-hatespeech/
Kumar et al., 2021 Designing Toxic Content Classification for a Diversity of Perspectives https://data.esrg.stanford.edu/study/toxicity-perspectives (contact author for password)
Nguyen et al., 2017 Biomedical Infomation Retrieval, each doc is annotated by roughly 5 Amazon Mechanical Turk workers https://github.com/yinfeiy/PICO-data
Zhang et al., 2022 Chinese Sentiment Words Identification, each sentence is annotated by 3 ~ 5 workers https://github.com/izhx/crowd-OEI
Grubenmann et al., 2018 Sentiment annotations for Swiss German sentences https://github.com/spinningbytes/SB-CH
Ji et al., 2022 KiloGram tangram dataset, 10 annotations per tangram (EMNLP 2022 best long paper award) https://github.com/lil-lab/kilogram
Kennedy et al., 2020 The gab hate corpus: A collection of 27k posts annotated for hate speech. [#Labels: 2, #Unique Raters: 18, Atleast 3 annotations per instance] https://osf.io/edua3/
Haber et al., 2023 SOA: Singapore online attacks, multilingual toxic data annotated with 3 annotators. https://github.com/rewire-online/singapore-online-attacks/tree/main
Liu et al., 2022 Word Associations with 19K explanations and 725 relation labels from 5 annotators https://github.com/ChunhuaLiu596/WAX/
Frermann et al., 2023 Multi-label frame annotations of 428 news articles, each labeled by 2-3 annotators https://github.com/phenixace/narrative-framing/tree/main/data
Sap et al., 2020 Social Bias Frames: Reasoning about Social and Power Implications of Language (3 annotators) https://maartensap.com/social-bias-frames/ πŸ”Έ
Fleisig et al., 2023 FairPrism: Evaluating Fairness-Related Harms in Generated Text (3 annotators) https://github.com/microsoft/FairPrism
Forbes et al., 2020 Social Chemistry 101: Learning to Reason about Social and Moral Norms (up to 5 crowd annotations) https://github.com/mbforbes/social-chemistry-101 πŸ”Έ
Lourie et al., 2021 Scruples-dilemmas: A Corpus of Community Ethical Judgments (with 5 crowd annotations per instance) https://github.com/allenai/scruples πŸ”Έ
Potts et al., 2021 Dyna-Sentiment (5 crowd annotations) https://github.com/cgpotts/dynasent πŸ”Έ
Danescu-Niculescu-Mizil et al. 2013 Wikipedia Politeness (with up to 5 crowd annotations) https://convokit.cornell.edu/documentation/wiki_politeness.html or https://github.com/minnesotanlp/Quantifying-Annotation-Disagreement πŸ”Έ
Madeddu et al., 2023 DisaggregHateIt: A Disaggregated Italian Dataset of Hate Speech (1.1k tweets annotated for hate, irony, stance; between 1 and 13 annotations per instance) https://github.com/madeddumarco/DisaggregHateIt

Computer Vision (CV) datasets

Reference Name or Description URL
Rodrigues et al. 2018 LabelMe: Image classification dataset with 8 categories, re-annotated http://fprodrigues.com//publications/deep-crowds/
Peterson et al., 2019 Cifar10H: Image classification with 10 categories, re-annotated http://github.com/jcpeterson/cifar-10h
Cheplygina et al. 2018 Medical lesion classification challenge, 6 annotators each https://figshare.com/s/5cbbce14647b66286544
Wei, Zhu et al., 2022 CIFAR-100N http://noisylabels.com/
Nguyen et al. 2020 VinDR-CXR: Object detection dataset on chest x-ray images, each training image labeled by 3 annotators https://www.kaggle.com/c/vinbigdata-chest-xray-abnormalities-detection/ or https://vindr.ai/datasets/cxr
Tschirschwitz et al. 2022 TexBiG: Instance segmentation dataset on historical layout analysis, each training image labeled by 2-4 annotators https://zenodo.org/record/8347059 or https://www.kaggle.com/datasets/davidtschirschwitz/texbig-v2-0-train-val