yuchenlin/LLM-Blender

Issue with downloading dataset from HuggingFace

swarnaHub opened this issue · 3 comments

Thanks for releasing the dataset! I am trying to download it from HuggingFace using

from datasets import load_dataset
dataset = load_dataset("llm-blender/mix-instruct")

But this gives me the following error:

id: string
instruction: string
to
{'id': Value(dtype='string', id=None), 'instruction': Value(dtype='string', id=None), 'input': Value(dtype='string', id=None), 'output': Value(dtype='string', id=None), 'candidates': [{'decoding_method': Value(dtype='string', id=None), 'model': Value(dtype='string', id=None), 'text': Value(dtype='string', id=None), 'scores': {'logprobs': Value(dtype='float64', id=None), 'rougeL': Value(dtype='float64', id=None), 'rouge2': Value(dtype='float64', id=None), 'rougeLsum': Value(dtype='float64', id=None), 'rouge1': Value(dtype='float64', id=None), 'bleu': Value(dtype='float64', id=None), 'bertscore': Value(dtype='float64', id=None), 'bleurt': Value(dtype='float64', id=None), 'bartscore': Value(dtype='float64', id=None)}}]}
because column names don't match
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "", line 1, in
File "/data/home/swarnadeep/miniconda/envs/multi/lib/python3.10/site-packages/datasets/load.py", line 1797, in load_dataset
builder_instance.download_and_prepare(
File "/data/home/swarnadeep/miniconda/envs/multi/lib/python3.10/site-packages/datasets/builder.py", line 890, in download_and_prepare
self._download_and_prepare(
File "/data/home/swarnadeep/miniconda/envs/multi/lib/python3.10/site-packages/datasets/builder.py", line 985, in _download_and_prepare
self._prepare_split(split_generator, **prepare_split_kwargs)
File "/data/home/swarnadeep/miniconda/envs/multi/lib/python3.10/site-packages/datasets/builder.py", line 1746, in _prepare_split
for job_id, done, content in self._prepare_split_single(
File "/data/home/swarnadeep/miniconda/envs/multi/lib/python3.10/site-packages/datasets/builder.py", line 1891, in _prepare_split_single
raise DatasetGenerationError("An error occurred while generating the dataset") from e
datasets.builder.DatasetGenerationError: An error occurred while generating the dataset

Thank you for reporting the issue @jdf-prog and I will look into this and get back to you asap.

@swarnaHub Thank you for pointing out the issue. It turns out to be the problem of the additional field of test split cmp_results that makes the hugging face dataset confused. I have fixed this format error and add an cmp_results for all the files, so now the data is good to be downloaded from huggingface with the same code above.

Thank you!