google-deepmind/rc-data

missing urls after multiple trial

zcyang opened this issue · 3 comments

Hi,

I follow the instructions to download the data set and there are still dozens of missing urls after I run

python generate_questions.py --corpus=[cnn/dailymail] --mode=download

multiple times?

I encountered the same problem yesterday.Has someone solved it yet?

I have verified the problem: For CNN, 10 question/answers are missing from the training test and 10 are missing from the test set in my case.

Changing allow_redirects=False to allow_redirects=True in two places in generate_questions.py will allow it to download newer revisions of Wayback Machines documents. It appears to work fine and generate correct test sets so I'll look into making it a permanent change.

See updated readme for downloading a processed version of the dataset.

Best, Lasse