SQuAD2-CR: A repository from antest1

SQuAD2-CR

Cause and Rationales for Unanswerability in SQuAD 2.0 Dataset

This dataset contains two types of annotation for unanswerable questions in the SQuAD 2.0 Dataset.
About SQuAD 2.0 Dataset, visit the official page.

Download Dataset

SQuAD2_CR_20191127.zip

Resource Description

Causes

This dataset contains why the question is unanswerable for the given context.
Each question in the dataset is classified into 6 classed based on original paper with minor edit.
The detail information for each class is described below.

Data format (tsv)

Column	Description	Format	Example
qid	ID from SQuAD 2.0	hash (len 24)	`5a678aa6f038b7001ab0c2a0`
reason	why the question can not be answered	{E, #, N, A, X, I}	`E`

Reason class definition & statistics for datasets

Name	Abbr	Description	Train	Test	Extended
Entity Swap	E	Entity replaced with other entity.	5818	1122	12597
Number Swap	#	Number or date replaced with other number or date.	1642	254	3167
Negation	N	Negation word inserted or removed.	1860	506	4099
Antonym	A	Antonym word for context is used in the question.	2818	593	7446
Mutual Exclusion	X	Word or phrase is mutually exclusive with something for which an answer is present.	318	256	2942
No Information	I	Asks for condition that is not satisfied by anything in the paragraph, or paragraph does not imply any answer.	841	375	2789
Total			13297	3106	33040

file description (tsv)

causes/reason_gold_{train,test,full}.tsv (seed data)
- manually annotated instances by human
- full = train + test
causes/reason_extended.tsv (augmented data)
- automatically propagated by semi-supervised learning
- note: as the data may contains lots of noise during the propagation,
  we do not recommend to use this data on the evaluation (use this only for training the model)

Rationales

This dataset contains word-level scores for each question, indicates how each word contributes the unanswerabiltiy of the question. Note that some of these values are not manually labeled by human, so there can be some noisy information although we performed manual check.

Data format (tsv)

Column	Description	Format	Example
qid	ID from SQuAD 2.0	hash (len 24)	`5ad3ed86604f3c001a3ff7b3`
question	question for given qid	tokens seperated with space	`What royalty has n't attended Yale ?`
word_att	attention for each word	numbers seperated with comma (,)	`0,0,0,1,0,0,0` (human-label) `0.008,0.029,0.108,0.997,0.012,0.006,0.0` (extended)

file description (tsv)

rationales/word_att_gold_{train,test}.tsv (seed data)
- automatically generated instances with the rule (see How we created the dataset) + manually annotated instances by human
- each attention has binary value (0 or 1)
rationales/word_att_extended.tsv (augmented data, only for unanswerable question)
- generated by semi-supervision with the seed data
- each attention has a real value (range: 0.0~1.0)
data statistics (train/test/distant): 21663/3360/49443

How we created the dataset

To generate seed data for word attention, we extract answerable and unanswerable question pairs from SQuAD 2.0 dataset that has common context and answer span. For the common words in the question pair we label these words as 0 since they are tends to be unimportant for determining the answerability of the question. Otherwise, we label the words as 1.

To extend annotation from the human-labeled data, we apply tri-training (as proxy-label approach) to propagate existing annotation to unlabeled instances.

License

MIT License

antest1/SQuAD2-CR

SQuAD2-CR

Download Dataset

Resource Description

Causes

Data format (tsv)

Reason class definition & statistics for datasets

file description (tsv)

Rationales

Data format (tsv)

file description (tsv)

How we created the dataset

License