/nlp-datasets

A list of datasets/corpora for NLP tasks, in reverse chronological order.

Datasets for Natural Language Processing

This is a list of datasets/corpora for NLP tasks, in reverse chronological order. Suggestions and pull requests are welcome. The goal is to make this a collaborative effort to maintain an updated list of quality datasets.

Areas

Question Answering

  • (MS MARCO) MS MARCO: A Human Generated MAchine Reading COmprehension Dataset, 2016 [paper] [data]
  • (NewsQA) NewsQA: A Machine Comprehension Dataset, 2016 [paper] [data]
  • (SQuAD) SQuAD: 100,000+ Questions for Machine Comprehension of Text, 2016 [paper] [data]
  • (GraphQuestions) On Generating Characteristic-rich Question Sets for QA Evaluation, 2016 [paper] [data]
  • (Story Cloze) A Corpus and Cloze Evaluation for Deeper Understanding of Commonsense Stories, 2016 [paper] [data]
  • (Children's Book Test) The Goldilocks Principle: Reading Children's Books with Explicit Memory Representations, 2015 [paper] [data]
  • (SimpleQuestions) Large-scale Simple Question Answering with Memory Networks, 2015 [paper] [data]
  • (WikiQA) WikiQA: A Challenge Dataset for Open-Domain Question Answering, 2015 [paper] [data]
  • (CNN-DailyMail) Teaching Machines to Read and Comprehend, 2015 [paper] [code to generate] [data]
  • (QuizBowl) A Neural Network for Factoid Question Answering over Paragraphs, 2014 [paper] [data]
  • (MCTest) MCTest: A Challenge Dataset for the Open-Domain Machine Comprehension of Text, 2013 [paper] [data] [alternate data link]
  • (QASent) What is the Jeopardy model? A quasisynchronous grammar for QA, 2007 [paper] [data]

Dialogue Systems

  • (Ubuntu Dialogue Corpus) The Ubuntu Dialogue Corpus : A Large Dataset for Research in Unstructured Multi-Turn Dialogue Systems, 2015 [paper] [data]

Goal-Oriented Dialogue Systems

  • (Frames) Frames: A Corpus for Adding Memory to Goal-Oriented Dialogue Systems, 2016 [paper] [data]
  • (DSTC 2 & 3) Dialog State Tracking Challenge 2 & 3, 2013 [paper] [data]