URL invalid for some datasets:
shmsw25 opened this issue · 6 comments
Hi, thank you for such a great paper & resources! I just wanted to report that downloading some of datasets using scripts in tasks/
does not work, presumably because the dataset urls got invalid by the original host. In particular, here is the list of datasets that gave errors due to invalid urls.
- jeopardy
- kilt_wow
- definite_pronoun_resolution
- wiki_auto
Thank you for raising this! Bill (@yuchenlin) and I will try to find some workaround.
Hi Sewon,
I'm trying to reproduce this issue but my scripts are working as expected. Could you please provide some extra information for us? Thank you.
- What are the error messages you're getting?
- Could you double-check if your huggingface dataset has version 1.4.0 and could you please try the scripts again after clearing the cache?
Hi @cherry979988, thank you for your help. Yes, I double-checked that the HF datasets version is 1.4.0, and the error is keep occurring after clearing the cache. Error messages are saved here.
P.S. I think if you have downloaded the data once, the data is saved as a cache. Perhaps that is why you were not able to reproduce the error?
Hi @shmsw25
Thank you for providing the logs. I am able to reproduce the errors.
My guess is that the dataset owners updated their files, and the checksums in HF datasets is not yet updated, so we're getting this checksum error.
A temporary solution will be using ignore_verifications=True
when loading datasets (e.g., dataset = load_dataset("kilt_tasks", "wow", ignore_verifications=True)
). However, this will probably leads to differences in few-shot sampling. I'll discuss with Bill and see if there is a better solution...
Got it, thank you for taking a look at this!
@cherry979988 Would you mind sharing your cache of the following for the unavailable network?
- jeopardy
- kilt_wow
- definite_pronoun_resolution
- wiki_auto