/SQuAD2-CR

Semi-supervised Annotation forCause and Rationales for Unanswerability in SQuAD 2.0

SQuAD2-CR

Cause and Rationales for Unanswerability in SQuAD 2.0 Dataset

This dataset contains two types of annotation for unanswerable questions in the SQuAD 2.0 Dataset.
About SQuAD 2.0 Dataset, visit the official page.

Download Dataset

SQuAD2_CR_20191127.zip

Resource Description

Causes

This dataset contains why the question is unanswerable for the given context.
Each question in the dataset is classified into 6 classed based on original paper with minor edit.
The detail information for each class is described below.

Data format (tsv)

Column Description Format Example
qid ID from SQuAD 2.0 hash (len 24) 5a678aa6f038b7001ab0c2a0
reason why the question can not be answered {E, #, N, A, X, I} E

Reason class definition & statistics for datasets

Name Abbr Description Train Test Extended
Entity Swap E Entity replaced with other entity. 5818 1122 12597
Number Swap # Number or date replaced with other number or date. 1642 254 3167
Negation N Negation word inserted or removed. 1860 506 4099
Antonym A Antonym word for context is used in the question. 2818 593 7446
Mutual Exclusion X Word or phrase is mutually exclusive with something for which an answer is present. 318 256 2942
No Information I Asks for condition that is not satisfied by anything in the paragraph, or paragraph does not imply any answer. 841 375 2789
Total 13297 3106 33040

file description (tsv)

  • causes/reason_gold_{train,test,full}.tsv (seed data)
    • manually annotated instances by human
    • full = train + test
  • causes/reason_extended.tsv (augmented data)
    • automatically propagated by semi-supervised learning
    • note: as the data may contains lots of noise during the propagation,
      we do not recommend to use this data on the evaluation (use this only for training the model)

Rationales

This dataset contains word-level scores for each question, indicates how each word contributes the unanswerabiltiy of the question. Note that some of these values are not manually labeled by human, so there can be some noisy information although we performed manual check.

Data format (tsv)

Column Description Format Example
qid ID from SQuAD 2.0 hash (len 24) 5ad3ed86604f3c001a3ff7b3
question question for given qid tokens seperated with space What royalty has n't attended Yale ?
word_att attention for each word numbers seperated with comma (,) 0,0,0,1,0,0,0 (human-label)
0.008,0.029,0.108,0.997,0.012,0.006,0.0 (extended)

file description (tsv)

  • rationales/word_att_gold_{train,test}.tsv (seed data)
    • automatically generated instances with the rule (see How we created the dataset) + manually annotated instances by human
    • each attention has binary value (0 or 1)
  • rationales/word_att_extended.tsv (augmented data, only for unanswerable question)
    • generated by semi-supervision with the seed data
    • each attention has a real value (range: 0.0~1.0)
  • data statistics (train/test/distant): 21663/3360/49443

How we created the dataset

To generate seed data for word attention, we extract answerable and unanswerable question pairs from SQuAD 2.0 dataset that has common context and answer span. For the common words in the question pair we label these words as 0 since they are tends to be unimportant for determining the answerability of the question. Otherwise, we label the words as 1.

To extend annotation from the human-labeled data, we apply tri-training (as proxy-label approach) to propagate existing annotation to unlabeled instances.

License

MIT License