/QA_and_QG

An inventory of data sets around Question Generation and Question Answering

Question Generation and Question Answering Data Sets

The following is an inventory of data sets around the Natural Language Processing (NLP) domains of Natural Language Generation (NLG)/ Question Generation (QG) and Natural Language Understanding (NLU)/ Question Answering (QA). The motivation to include QA into this repository is simply that often the two occur together. If a corpus is mentioned with a dash ('-') then it is not strictly a QG/NLG or QA/NLU corpus but has been mentioned in a related publication.

Data Sets

Type Name Link
QA SQuAD2.0 - The Stanford Question Answering Dataset https://rajpurkar.github.io/SQuAD-explorer/
QA Question-Answer Dataset http://www.cs.cmu.edu/~ark/QA-data/
QA A Corpus for Complex Question Answering over Knowledge Graphs http://sda.cs.uni-bonn.de/projects/qa-dataset/
QA WebQuestions https://nlp.stanford.edu/software/sempre/
QG Question Generation Shared Task & Evaluation Challenge (QGSTEC) 2010 - Generating Questions from Sentences https://github.com/bjwyse/QGSTEC2010
QA RecipeQA - A Dataset for Multimodal Comprehension of Cooking Recipes https://hucvl.github.io/recipeqa/
Cornell Movie--Dialogs Corpus https://www.cs.cornell.edu/~cristian/Cornell_Movie-Dialogs_Corpus.html
Ubuntu Dialogue Corpus v2.0 https://github.com/rkadlec/ubuntu-ranking-dataset-creator
OSU Twitter NLP Tools https://github.com/aritter/twitter_nlp
NLG WeatherGov https://cs.stanford.edu/~pliang/data/weather-data.zip
NLG Boxscore-Data https://github.com/harvardnlp/boxscore-data
NLG WebNLG 2017 Challenge Data http://webnlg.loria.fr/pages/challenge.html
NLG Wikipedia-biography-dataset https://github.com/DavidGrangier/wikipedia-biography-dataset
NLG RNNLG https://github.com/shawnwun/RNNLG
NLG ACL-Overview https://aclweb.org/aclwiki/Data_sets_for_NLG