Infini-AI-Lab/Sequoia

data loading timing and disk use

poedator opened this issue · 0 comments

The dataset loading code is taking too long. It downloads whole huge datasets (70G wiki, etc) to use just a handful of examples. setting split="train[0:2000]") is not helping since slicing happens only after full download
Suggestions:

  • download just the first files of the datasets.
  • replace c4 with allenai/c4: load_dataset("allenai/c4", "allenai--c4", data_files={"train": "en/c4-train.00000-of-01024.json.gz"}, split="train")
  • replace wiki with wikitext2. load_dataset("wikitext", "wikitext-2-raw-v1", split="train")