Cannot load the c4 dataset

Question

Cannot load the c4 dataset

Closed this issue 7 months ago · 2 comments

Hello,
I tried many things to be able to load the c4 dataset but I keep getting new errors. I already ran pip install -U datasets and pip install -U transformers. It didn't work. I wrote all the other things I tried step-by-step.

I get the following message:

ValueError: BuilderConfig 'allenai--c4' not found. Available: ['en', 'en.noblocklist', 'en.noclean', 'realnewslike', 'multilingual', 'af', 'am', 'ar', 'az', 'be', 'bg', 'bg-Latn', 'bn', 'ca', 'ceb', 'co', 'cs', 'cy', 'da', 'de', 'el', 'el-Latn', 'en-multi', 'eo', 'es', 'et', 'eu', 'fa', 'fi', 'fil', 'fr', 'fy', 'ga', 'gd', 'gl', 'gu', 'ha', 'haw', 'hi', 'hi-Latn', 'hmn', 'ht', 'hu', 'hy', 'id', 'ig', 'is', 'it', 'iw', 'ja', 'ja-Latn', 'jv', 'ka', 'kk', 'km', 'kn', 'ko', 'ku', 'ky', 'la', 'lb', 'lo', 'lt', 'lv', 'mg', 'mi', 'mk', 'ml', 'mn', 'mr', 'ms', 'mt', 'my', 'ne', 'nl', 'no', 'ny', 'pa', 'pl', 'ps', 'pt', 'ro', 'ru', 'ru-Latn', 'sd', 'si', 'sk', 'sl', 'sm', 'sn', 'so', 'sq', 'sr', 'st', 'su', 'sv', 'sw', 'ta', 'te', 'tg', 'th', 'tr', 'uk', 'und', 'ur', 'uz', 'vi', 'xh', 'yi', 'yo', 'zh', 'zh-Latn', 'zu']

I changed the code for the c4 data to the following:

traindata = load_dataset('allenai/c4', 'en', data_files={'train': 'en/c4-train.00000-of-01024.json.gz'}, split='train')
valdata = load_dataset('allenai/c4', 'en', data_files={'validation': 'en/c4-validation.00000-of-00008.json.gz'}, split='validation')

Then, I started getting the following error:

File "/simla/wanda/lib/data.py", line 48, in get_c4
traindata = load_dataset('allenai/c4', 'en', data_files={'train': 'en/c4-train.00000-of-01024.json.gz'}, split='train')
File "/home/.local/lib/python3.10/site-packages/datasets/load.py", line 2549, in load_dataset
builder_instance.download_and_prepare(
File "/home/.local/lib/python3.10/site-packages/datasets/builder.py", line 1005, in download_and_prepare
self._download_and_prepare(
File "/home/.local/lib/python3.10/site-packages/datasets/builder.py", line 1118, in _download_and_prepare
verify_splits(self.info.splits, split_dict)
File "/home/.local/lib/python3.10/site-packages/datasets/utils/info_utils.py", line 92, in verify_splits
raise ExpectedMoreSplits(str(set(expected_splits) - set(recorded_splits)))
datasets.utils.info_utils.ExpectedMoreSplits: {'validation'}

I tried downloading with:

GIT_LFS_SKIP_SMUDGE=1 git clone https://huggingface.co/datasets/allenai/c4
cd c4
git lfs pull --include "en/*"

After downloading the whole dataset, I need to change the load_dataset function to call the local files. So I did the following:

traindata = load_dataset('/simla/wanda/c4', 'en', data_files={'train': 'en/c4-train.00000-of-01024.json.gz'}, split='train', trust_remote_code=True)
 valdata = load_dataset('/simla/wanda/c4', 'en', data_files={'validation': 'en/c4-validation.00000-of-00008.json.gz'}, split='validation', trust_remote_code=True)

Now I am getting the following error:

Failed to read file '/simla/wanda/c4/en/c4-train.00000-of-01024.json.gz' with error <class 'pyarrow.lib.ArrowInvalid'>: JSON parse error: Invalid value. in row 0
Generating train split: 0%| | 0/364868892 [00:00<?, ? examples/s]
Traceback (most recent call last):
File "/home/.local/lib/python3.10/site-packages/datasets/packaged_modules/json/json.py", line 144, in _generate_tables
dataset = json.load(f)
File "/usr/lib/python3.10/json/init.py", line 293, in load
return loads(fp.read(),
File "/usr/lib/python3.10/json/init.py", line 346, in loads
return _default_decoder.decode(s)
File "/usr/lib/python3.10/json/decoder.py", line 337, in decode
obj, end = self.raw_decode(s, idx=_w(s, 0).end())
File "/usr/lib/python3.10/json/decoder.py", line 355, in raw_decode
raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/home/.local/lib/python3.10/site-packages/datasets/builder.py", line 1973, in _prepare_split_single
for _, table in generator:
File "/home/.local/lib/python3.10/site-packages/datasets/packaged_modules/json/json.py", line 147, in _generate_tables
raise e
File "/home/.local/lib/python3.10/site-packages/datasets/packaged_modules/json/json.py", line 121, in _generate_tables
pa_table = paj.read_json(
File "pyarrow/_json.pyx", line 259, in pyarrow._json.read_json
File "pyarrow/error.pxi", line 144, in pyarrow.lib.pyarrow_internal_check_status
File "pyarrow/error.pxi", line 100, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: JSON parse error: Invalid value. in row 0

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "/simla/wanda/main.py", line 110, in
main()
File "/simla/wanda/main.py", line 69, in main
prune_wanda(args, model, tokenizer, device, prune_n=prune_n, prune_m=prune_m)
File "/simla/wanda/lib/prune.py", line 132, in prune_wanda
dataloader, _ = get_loaders("c4",nsamples=args.nsamples,seed=args.seed,seqlen=model.seqlen,tokenizer=tokenizer)
File "/simla/wanda/lib/data.py", line 80, in get_loaders
return get_c4(nsamples, seed, seqlen, tokenizer)
File "/simla/wanda/lib/data.py", line 50, in get_c4
traindata = load_dataset('/simla/wanda/c4', 'en', data_files={'train': 'en/c4-train.00000-of-01024.json.gz'}, split='train', trust_remote_code=True)
File "/home/.local/lib/python3.10/site-packages/datasets/load.py", line 2549, in load_dataset
builder_instance.download_and_prepare(
File "/home/.local/lib/python3.10/site-packages/datasets/builder.py", line 1005, in download_and_prepare
self._download_and_prepare(
File "/home/.local/lib/python3.10/site-packages/datasets/builder.py", line 1100, in _download_and_prepare
self._prepare_split(split_generator, **prepare_split_kwargs)
File "/home/.local/lib/python3.10/site-packages/datasets/builder.py", line 1860, in _prepare_split
for job_id, done, content in self._prepare_split_single(
File "/home/.local/lib/python3.10/site-packages/datasets/builder.py", line 2016, in _prepare_split_single
raise DatasetGenerationError("An error occurred while generating the dataset") from e
datasets.exceptions.DatasetGenerationError: An error occurred while generating the dataset

Answer 1 · 2024-01-26T12:14:27.000Z

I have found the solution finally: https://huggingface.co/datasets/allenai/c4/discussions/7

Answer 2 · 2024-08-06T12:33:31.000Z

this solution can be applied in the code by editing the following lines:

https://github.com/locuslab/wanda/blob/main/lib/data.py#L43-44