missing urls after multiple trial
zcyang opened this issue · 3 comments
zcyang commented
Hi,
I follow the instructions to download the data set and there are still dozens of missing urls after I run
python generate_questions.py --corpus=[cnn/dailymail] --mode=download
multiple times?
zhzou2020 commented
I encountered the same problem yesterday.Has someone solved it yet?
lespeholt commented
I have verified the problem: For CNN, 10 question/answers are missing from the training test and 10 are missing from the test set in my case.
Changing allow_redirects=False to allow_redirects=True in two places in generate_questions.py will allow it to download newer revisions of Wayback Machines documents. It appears to work fine and generate correct test sets so I'll look into making it a permanent change.
lespeholt commented
See updated readme for downloading a processed version of the dataset.
Best, Lasse