URL invalid for some datasets:

Question

URL invalid for some datasets:

shmsw25 opened this issue 3 years ago · 6 comments

Hi, thank you for such a great paper & resources! I just wanted to report that downloading some of datasets using scripts in tasks/ does not work, presumably because the dataset urls got invalid by the original host. In particular, here is the list of datasets that gave errors due to invalid urls.

jeopardy
kilt_wow
definite_pronoun_resolution
wiki_auto

Answer 1 · 2021-08-19T19:03:15.000Z

Thank you for raising this! Bill (@yuchenlin) and I will try to find some workaround.

Answer 2 · 2021-08-23T18:52:28.000Z

Hi Sewon,

I'm trying to reproduce this issue but my scripts are working as expected. Could you please provide some extra information for us? Thank you.

What are the error messages you're getting?
Could you double-check if your huggingface dataset has version 1.4.0 and could you please try the scripts again after clearing the cache?

Attaching my logs for reference.

Answer 3 · 2021-08-29T10:45:16.000Z

Hi @cherry979988, thank you for your help. Yes, I double-checked that the HF datasets version is 1.4.0, and the error is keep occurring after clearing the cache. Error messages are saved here.

P.S. I think if you have downloaded the data once, the data is saved as a cache. Perhaps that is why you were not able to reproduce the error?

Answer 4 · 2021-08-30T20:13:28.000Z

Hi @shmsw25

Thank you for providing the logs. I am able to reproduce the errors.

My guess is that the dataset owners updated their files, and the checksums in HF datasets is not yet updated, so we're getting this checksum error.

A temporary solution will be using ignore_verifications=True when loading datasets (e.g., dataset = load_dataset("kilt_tasks", "wow", ignore_verifications=True)). However, this will probably leads to differences in few-shot sampling. I'll discuss with Bill and see if there is a better solution...

Answer 5 · 2021-09-12T03:11:38.000Z

Got it, thank you for taking a look at this!

Answer 6 · 2021-11-12T05:00:43.000Z

@cherry979988 Would you mind sharing your cache of the following for the unavailable network?

jeopardy
kilt_wow
definite_pronoun_resolution
wiki_auto