AMAAI-Lab/MidiCaps

Hugging Face Dataset seems to be corrupted :(

asigalov61 opened this issue · 4 comments

Hey @dorienh @elchico1990 @Dapwner @ismirsubmission198

I wanted to try MidiCaps today but it seems that the dataset (json files) are corrupted. Here is the code and the traceback:

mc_dataset = load_dataset("amaai-lab/MidiCaps")

Generating train split: 
 168385/0 [00:02<00:00, 195311.74 examples/s]
Failed to load JSON from file '/root/.cache/huggingface/datasets/downloads/3307e000d26ff30aa307ac62029c7e215e9691a75780dc2714afd1493a96e2f9' with error <class 'pyarrow.lib.ArrowInvalid'>: JSON parse error: Column(/genre/[]/[]) changed from string to number in row 0
ERROR:datasets.packaged_modules.json.json:Failed to load JSON from file '/root/.cache/huggingface/datasets/downloads/3307e000d26ff30aa307ac62029c7e215e9691a75780dc2714afd1493a96e2f9' with error <class 'pyarrow.lib.ArrowInvalid'>: JSON parse error: Column(/genre/[]/[]) changed from string to number in row 0
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
[/usr/local/lib/python3.10/dist-packages/datasets/packaged_modules/json/json.py](https://localhost:8080/#) in _generate_tables(self, files)
    152                                 ) as f:
--> 153                                     df = pd.read_json(f, dtype_backend="pyarrow")
    154                             except ValueError:

16 frames
[/usr/local/lib/python3.10/dist-packages/pandas/io/json/_json.py](https://localhost:8080/#) in read_json(path_or_buf, orient, typ, dtype, convert_axes, convert_dates, keep_default_dates, precise_float, date_unit, encoding, encoding_errors, lines, chunksize, compression, nrows, storage_options, dtype_backend, engine)
    783     else:
--> 784         return json_reader.read()
    785 

[/usr/local/lib/python3.10/dist-packages/pandas/io/json/_json.py](https://localhost:8080/#) in read(self)
    974                 else:
--> 975                     obj = self._get_object_parser(self.data)
    976                 if self.dtype_backend is not lib.no_default:

[/usr/local/lib/python3.10/dist-packages/pandas/io/json/_json.py](https://localhost:8080/#) in _get_object_parser(self, json)
   1000         if typ == "frame":
-> 1001             obj = FrameParser(json, **kwargs).parse()
   1002 

[/usr/local/lib/python3.10/dist-packages/pandas/io/json/_json.py](https://localhost:8080/#) in parse(self)
   1133     def parse(self):
-> 1134         self._parse()
   1135 

[/usr/local/lib/python3.10/dist-packages/pandas/io/json/_json.py](https://localhost:8080/#) in _parse(self)
   1319             self.obj = DataFrame(
-> 1320                 loads(json, precise_float=self.precise_float), dtype=None
   1321             )

ValueError: Trailing data

During handling of the above exception, another exception occurred:

ArrowInvalid                              Traceback (most recent call last)
[/usr/local/lib/python3.10/dist-packages/datasets/builder.py](https://localhost:8080/#) in _prepare_split_single(self, gen_kwargs, fpath, file_format, max_shard_size, job_id)
   1996                 _time = time.time()
-> 1997                 for _, table in generator:
   1998                     if max_shard_size is not None and writer._num_bytes > max_shard_size:

[/usr/local/lib/python3.10/dist-packages/datasets/packaged_modules/json/json.py](https://localhost:8080/#) in _generate_tables(self, files)
    155                                 logger.error(f"Failed to load JSON from file '{file}' with error {type(e)}: {e}")
--> 156                                 raise e
    157                             if df.columns.tolist() == [0]:

[/usr/local/lib/python3.10/dist-packages/datasets/packaged_modules/json/json.py](https://localhost:8080/#) in _generate_tables(self, files)
    129                                 try:
--> 130                                     pa_table = paj.read_json(
    131                                         io.BytesIO(batch), read_options=paj.ReadOptions(block_size=block_size)

/usr/local/lib/python3.10/dist-packages/pyarrow/_json.pyx in pyarrow._json.read_json()

/usr/local/lib/python3.10/dist-packages/pyarrow/error.pxi in pyarrow.lib.pyarrow_internal_check_status()

/usr/local/lib/python3.10/dist-packages/pyarrow/error.pxi in pyarrow.lib.check_status()

ArrowInvalid: JSON parse error: Column(/genre/[]/[]) changed from string to number in row 0

The above exception was the direct cause of the following exception:

DatasetGenerationError                    Traceback (most recent call last)
[<ipython-input-12-661da53cceac>](https://localhost:8080/#) in <cell line: 1>()
----> 1 mc_dataset = load_dataset("amaai-lab/MidiCaps")

[/usr/local/lib/python3.10/dist-packages/datasets/load.py](https://localhost:8080/#) in load_dataset(path, name, data_dir, data_files, split, cache_dir, features, download_config, download_mode, verification_mode, ignore_verifications, keep_in_memory, save_infos, revision, token, use_auth_token, task, streaming, num_proc, storage_options, trust_remote_code, **config_kwargs)
   2614 
   2615     # Download and prepare data
-> 2616     builder_instance.download_and_prepare(
   2617         download_config=download_config,
   2618         download_mode=download_mode,

[/usr/local/lib/python3.10/dist-packages/datasets/builder.py](https://localhost:8080/#) in download_and_prepare(self, output_dir, download_config, download_mode, verification_mode, ignore_verifications, try_from_hf_gcs, dl_manager, base_path, use_auth_token, file_format, max_shard_size, num_proc, storage_options, **download_and_prepare_kwargs)
   1027                         if num_proc is not None:
   1028                             prepare_split_kwargs["num_proc"] = num_proc
-> 1029                         self._download_and_prepare(
   1030                             dl_manager=dl_manager,
   1031                             verification_mode=verification_mode,

[/usr/local/lib/python3.10/dist-packages/datasets/builder.py](https://localhost:8080/#) in _download_and_prepare(self, dl_manager, verification_mode, **prepare_split_kwargs)
   1122             try:
   1123                 # Prepare split will record examples associated to the split
-> 1124                 self._prepare_split(split_generator, **prepare_split_kwargs)
   1125             except OSError as e:
   1126                 raise OSError(

[/usr/local/lib/python3.10/dist-packages/datasets/builder.py](https://localhost:8080/#) in _prepare_split(self, split_generator, file_format, num_proc, max_shard_size)
   1882             job_id = 0
   1883             with pbar:
-> 1884                 for job_id, done, content in self._prepare_split_single(
   1885                     gen_kwargs=gen_kwargs, job_id=job_id, **_prepare_split_args
   1886                 ):

[/usr/local/lib/python3.10/dist-packages/datasets/builder.py](https://localhost:8080/#) in _prepare_split_single(self, gen_kwargs, fpath, file_format, max_shard_size, job_id)
   2038             if isinstance(e, DatasetGenerationError):
   2039                 raise
-> 2040             raise DatasetGenerationError("An error occurred while generating the dataset") from e
   2041 
   2042         yield job_id, True, (total_num_examples, total_num_bytes, writer._features, num_shards, shard_lengths)

DatasetGenerationError: An error occurred while generating the dataset

I would really appreciate if you can fix this error soon :)

Sincerely,

Alex

Hi @asigalov61, Thanks for raising this issue. We are looking into the error. If you wish to look at the dataset asap, please manually download the dataset. Thanks

@elchico1990 You are welcome and I appreciate your fast response.

Where can I manually download the dataset? I tried downloading json files from Hugging Face but they also seem to be corrupted. Is there an alt download link?

Thank you,

Alex.

Hi, we updated the .json file, the dataset should now be downloadable through load_dataset("amaai-lab/MidiCaps")

In that process, we merged all the 3 versions of our .json files into a single one. Give it a try!

Closing the issue.

@Dapwner Yes, thank you for your support! :) Everything seems to work fine now :)

I will try it on a sentence transformer implementation and let you know the results :)

Thanks again

Alex