data loading timing and disk use
poedator opened this issue · 0 comments
poedator commented
The dataset loading code is taking too long. It downloads whole huge datasets (70G wiki, etc) to use just a handful of examples. setting split="train[0:2000]")
is not helping since slicing happens only after full download
Suggestions:
- download just the first files of the datasets.
- replace c4 with
allenai/c4
:load_dataset("allenai/c4", "allenai--c4", data_files={"train": "en/c4-train.00000-of-01024.json.gz"}, split="train")
- replace wiki with wikitext2.
load_dataset("wikitext", "wikitext-2-raw-v1", split="train")