Associated paper:
https://arxiv.org/abs/1704.05179
Here are raw, split, and processed files: https://drive.google.com/drive/u/2/folders/1kBkQGooNyG0h8waaOJpgdGtOnlb1S649
One can collect the original json files through web search using the scripts in qacrawler. Please refer to the README in the folder for further details on how to use the scraper. Furthermore, one can use the files in the test folder to try it. The above link also contains the original json files that are collected using the Jeopardy! dataset.
There are also stat files that gives the number of snippets found for the question associated to its filename. This number can range from 0 to 100. For some questions the crawler is set to collect the first 50 snippets and for some it was 100. When the search doesn't give enough results to reach this level then the ones available are collected. During the training we ignored all the files that contain 40 or less snippets to eliminate possible trivial cases. Also, the training data ignores snippets from the 51st onward.
And here is the link for the Jeopardy! files themselves:
https://www.reddit.com/r/datasets/comments/1uyd0t/200000_jeopardy_questions_in_a_json_file/
NOTE: We will release the the script that converts these to the training files above with appropriate restrictions.
Some requirements:
nltk==3.2.1
pandas==0.18.1
selenium==2.53.6
pytest==3.0.2
pytorch==0.1.11