/c3

Investigating Prior Knowledge for Challenging Chinese Machine Reading Comprehension

OtherNOASSERTION

C3

Overview

This repository maintains C3, the first free-form multiple-Choice Chinese machine reading Comprehension dataset.

@article{sun2019investigating,
  title={Investigating Prior Knowledge for Challenging Chinese Machine Reading Comprehension},
  author={Sun, Kai and Yu, Dian and Yu, Dong and Cardie, Claire},
  journal={Transactions of the Association for Computational Linguistics},
  year={2020},
  url={https://arxiv.org/abs/1904.09679v3}
}

Files in this repository:

  • license.txt: the license of C3.
  • data/c3-{m,d}-{train,dev,test}.json: the dataset files, where m and d represent "mixed-genre" and "dialogue", respectively. The data format is as follows.
[
  [
    [
      document 1
    ],
    [
      {
        "question": document 1 / question 1,
        "choice": [
          document 1 / question 1 / answer option 1,
          document 1 / question 1 / answer option 2,
          ...
        ],
        "answer": document 1 / question 1 / correct answer option
      },
      {
        "question": document 1 / question 2,
        "choice": [
          document 1 / question 2 / answer option 1,
          document 1 / question 2 / answer option 2,
          ...
        ],
        "answer": document 1 / question 2 / correct answer option
      },
      ...
    ],
    document 1 / id
  ],
  [
    [
      document 2
    ],
    [
      {
        "question": document 2 / question 1,
        "choice": [
          document 2 / question 1 / answer option 1,
          document 2 / question 1 / answer option 2,
          ...
        ],
        "answer": document 2 / question 1 / correct answer option
      },
      {
        "question": document 2 / question 2,
        "choice": [
          document 2 / question 2 / answer option 1,
          document 2 / question 2 / answer option 2,
          ...
        ],
        "answer": document 2 / question 2 / correct answer option
      },
      ...
    ],
    document 2 / id
  ],
  ...
]
  • annotation/c3-{m,d}-{dev,test}.txt: question type annotations. Each file contains 150 annotated instances. We adopt the following abbreviations:
Abbreviation Question Type
Matching m Matching
Prior knowledge l Linguistic
s Domain-specific
c-a Arithmetic
c-o Connotation
c-e Cause-effect
c-i Implication
c-p Part-whole
c-d Precondition
c-h Scenario
c-n Other
Supporting Sentences 0 Single Sentence
1 Multiple sentences
2 Independent