Hugging Face Dataset seems to be corrupted :(
asigalov61 opened this issue · 4 comments
Hey @dorienh @elchico1990 @Dapwner @ismirsubmission198
I wanted to try MidiCaps today but it seems that the dataset (json files) are corrupted. Here is the code and the traceback:
mc_dataset = load_dataset("amaai-lab/MidiCaps")
Generating train split:
168385/0 [00:02<00:00, 195311.74 examples/s]
Failed to load JSON from file '/root/.cache/huggingface/datasets/downloads/3307e000d26ff30aa307ac62029c7e215e9691a75780dc2714afd1493a96e2f9' with error <class 'pyarrow.lib.ArrowInvalid'>: JSON parse error: Column(/genre/[]/[]) changed from string to number in row 0
ERROR:datasets.packaged_modules.json.json:Failed to load JSON from file '/root/.cache/huggingface/datasets/downloads/3307e000d26ff30aa307ac62029c7e215e9691a75780dc2714afd1493a96e2f9' with error <class 'pyarrow.lib.ArrowInvalid'>: JSON parse error: Column(/genre/[]/[]) changed from string to number in row 0
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
[/usr/local/lib/python3.10/dist-packages/datasets/packaged_modules/json/json.py](https://localhost:8080/#) in _generate_tables(self, files)
152 ) as f:
--> 153 df = pd.read_json(f, dtype_backend="pyarrow")
154 except ValueError:
16 frames
[/usr/local/lib/python3.10/dist-packages/pandas/io/json/_json.py](https://localhost:8080/#) in read_json(path_or_buf, orient, typ, dtype, convert_axes, convert_dates, keep_default_dates, precise_float, date_unit, encoding, encoding_errors, lines, chunksize, compression, nrows, storage_options, dtype_backend, engine)
783 else:
--> 784 return json_reader.read()
785
[/usr/local/lib/python3.10/dist-packages/pandas/io/json/_json.py](https://localhost:8080/#) in read(self)
974 else:
--> 975 obj = self._get_object_parser(self.data)
976 if self.dtype_backend is not lib.no_default:
[/usr/local/lib/python3.10/dist-packages/pandas/io/json/_json.py](https://localhost:8080/#) in _get_object_parser(self, json)
1000 if typ == "frame":
-> 1001 obj = FrameParser(json, **kwargs).parse()
1002
[/usr/local/lib/python3.10/dist-packages/pandas/io/json/_json.py](https://localhost:8080/#) in parse(self)
1133 def parse(self):
-> 1134 self._parse()
1135
[/usr/local/lib/python3.10/dist-packages/pandas/io/json/_json.py](https://localhost:8080/#) in _parse(self)
1319 self.obj = DataFrame(
-> 1320 loads(json, precise_float=self.precise_float), dtype=None
1321 )
ValueError: Trailing data
During handling of the above exception, another exception occurred:
ArrowInvalid Traceback (most recent call last)
[/usr/local/lib/python3.10/dist-packages/datasets/builder.py](https://localhost:8080/#) in _prepare_split_single(self, gen_kwargs, fpath, file_format, max_shard_size, job_id)
1996 _time = time.time()
-> 1997 for _, table in generator:
1998 if max_shard_size is not None and writer._num_bytes > max_shard_size:
[/usr/local/lib/python3.10/dist-packages/datasets/packaged_modules/json/json.py](https://localhost:8080/#) in _generate_tables(self, files)
155 logger.error(f"Failed to load JSON from file '{file}' with error {type(e)}: {e}")
--> 156 raise e
157 if df.columns.tolist() == [0]:
[/usr/local/lib/python3.10/dist-packages/datasets/packaged_modules/json/json.py](https://localhost:8080/#) in _generate_tables(self, files)
129 try:
--> 130 pa_table = paj.read_json(
131 io.BytesIO(batch), read_options=paj.ReadOptions(block_size=block_size)
/usr/local/lib/python3.10/dist-packages/pyarrow/_json.pyx in pyarrow._json.read_json()
/usr/local/lib/python3.10/dist-packages/pyarrow/error.pxi in pyarrow.lib.pyarrow_internal_check_status()
/usr/local/lib/python3.10/dist-packages/pyarrow/error.pxi in pyarrow.lib.check_status()
ArrowInvalid: JSON parse error: Column(/genre/[]/[]) changed from string to number in row 0
The above exception was the direct cause of the following exception:
DatasetGenerationError Traceback (most recent call last)
[<ipython-input-12-661da53cceac>](https://localhost:8080/#) in <cell line: 1>()
----> 1 mc_dataset = load_dataset("amaai-lab/MidiCaps")
[/usr/local/lib/python3.10/dist-packages/datasets/load.py](https://localhost:8080/#) in load_dataset(path, name, data_dir, data_files, split, cache_dir, features, download_config, download_mode, verification_mode, ignore_verifications, keep_in_memory, save_infos, revision, token, use_auth_token, task, streaming, num_proc, storage_options, trust_remote_code, **config_kwargs)
2614
2615 # Download and prepare data
-> 2616 builder_instance.download_and_prepare(
2617 download_config=download_config,
2618 download_mode=download_mode,
[/usr/local/lib/python3.10/dist-packages/datasets/builder.py](https://localhost:8080/#) in download_and_prepare(self, output_dir, download_config, download_mode, verification_mode, ignore_verifications, try_from_hf_gcs, dl_manager, base_path, use_auth_token, file_format, max_shard_size, num_proc, storage_options, **download_and_prepare_kwargs)
1027 if num_proc is not None:
1028 prepare_split_kwargs["num_proc"] = num_proc
-> 1029 self._download_and_prepare(
1030 dl_manager=dl_manager,
1031 verification_mode=verification_mode,
[/usr/local/lib/python3.10/dist-packages/datasets/builder.py](https://localhost:8080/#) in _download_and_prepare(self, dl_manager, verification_mode, **prepare_split_kwargs)
1122 try:
1123 # Prepare split will record examples associated to the split
-> 1124 self._prepare_split(split_generator, **prepare_split_kwargs)
1125 except OSError as e:
1126 raise OSError(
[/usr/local/lib/python3.10/dist-packages/datasets/builder.py](https://localhost:8080/#) in _prepare_split(self, split_generator, file_format, num_proc, max_shard_size)
1882 job_id = 0
1883 with pbar:
-> 1884 for job_id, done, content in self._prepare_split_single(
1885 gen_kwargs=gen_kwargs, job_id=job_id, **_prepare_split_args
1886 ):
[/usr/local/lib/python3.10/dist-packages/datasets/builder.py](https://localhost:8080/#) in _prepare_split_single(self, gen_kwargs, fpath, file_format, max_shard_size, job_id)
2038 if isinstance(e, DatasetGenerationError):
2039 raise
-> 2040 raise DatasetGenerationError("An error occurred while generating the dataset") from e
2041
2042 yield job_id, True, (total_num_examples, total_num_bytes, writer._features, num_shards, shard_lengths)
DatasetGenerationError: An error occurred while generating the dataset
I would really appreciate if you can fix this error soon :)
Sincerely,
Alex
Hi @asigalov61, Thanks for raising this issue. We are looking into the error. If you wish to look at the dataset asap, please manually download the dataset. Thanks
@elchico1990 You are welcome and I appreciate your fast response.
Where can I manually download the dataset? I tried downloading json files from Hugging Face but they also seem to be corrupted. Is there an alt download link?
Thank you,
Alex.
Hi, we updated the .json file, the dataset should now be downloadable through load_dataset("amaai-lab/MidiCaps")
In that process, we merged all the 3 versions of our .json files into a single one. Give it a try!
Closing the issue.
@Dapwner Yes, thank you for your support! :) Everything seems to work fine now :)
I will try it on a sentence transformer implementation and let you know the results :)
Thanks again
Alex