Issues with using the released hh dataset.
jltchiu opened this issue · 2 comments
Hi, I am trying to use your published dataset on the huggingface. I am trying to use it as follows
import os, sys, json
from datasets import load_dataset
dataset = load_dataset("fnlp/hh-rlhf-strength-cleaned")
print(dataset)
However, I get the below error
Downloading data files: 100%|████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 11008.67it/s]
Extracting data files: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 2365.65it/s]
Generating train split: 151214 examples [00:05, 25560.14 examples/s]
Generating validation split: 0 examples [00:00, ? examples/s]
Traceback (most recent call last):
File "/root/miniconda3/lib/python3.9/site-packages/datasets/builder.py", line 1940, in _prepare_split_single
writer.write_table(table)
File "/root/miniconda3/lib/python3.9/site-packages/datasets/arrow_writer.py", line 572, in write_table
pa_table = table_cast(pa_table, self._schema)
File "/root/miniconda3/lib/python3.9/site-packages/datasets/table.py", line 2328, in table_cast
return cast_table_to_schema(table, schema)
File "/root/miniconda3/lib/python3.9/site-packages/datasets/table.py", line 2286, in cast_table_to_schema
raise ValueError(f"Couldn't cast\n{table.schema}\nto\n{features}\nbecause column names don't match")
ValueError: Couldn't cast
std preference difference: double
rejected score list: list<item: double>
child 0, item: double
rejected: list<item: string>
child 0, item: string
chosen score list: list<item: double>
child 0, item: double
chosen: list<item: string>
child 0, item: string
mean preference difference: double
GPT4 label: int64
to
{'std preference difference': Value(dtype='float64', id=None), 'rejected score list': Sequence(feature=Value(dtype='float64', id=None), length=-1, id=None), 'rejected': Sequence(feature=Value(dtype='string', id=None), length=-1, id=None), 'chosen score list': Sequence(feature=Value(dtype='float64', id=None), length=-1, id=None), 'chosen': Sequence(feature=Value(dtype='string', id=None), length=-1, id=None), 'mean preference difference': Value(dtype='float64', id=None)}
because column names don't match
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/chiujustin01-pvc/workspace/work/get_data.py", line 5, in <module>
dataset = load_dataset("fnlp/hh-rlhf-strength-cleaned")
File "/root/miniconda3/lib/python3.9/site-packages/datasets/load.py", line 2153, in load_dataset
builder_instance.download_and_prepare(
File "/root/miniconda3/lib/python3.9/site-packages/datasets/builder.py", line 954, in download_and_prepare
self._download_and_prepare(
File "/root/miniconda3/lib/python3.9/site-packages/datasets/builder.py", line 1049, in _download_and_prepare
self._prepare_split(split_generator, **prepare_split_kwargs)
File "/root/miniconda3/lib/python3.9/site-packages/datasets/builder.py", line 1813, in _prepare_split
for job_id, done, content in self._prepare_split_single(
File "/root/miniconda3/lib/python3.9/site-packages/datasets/builder.py", line 1958, in _prepare_split_single
raise DatasetGenerationError("An error occurred while generating the dataset") from e
datasets.builder.DatasetGenerationError: An error occurred while generating the dataset
I have use the same code for other dataset and it seems to work. Do you know where should I fix my code?
我也遇到了 你解决了吗
After investigation, we identified that the inconsistency between the 'train' and 'valid' fields was causing datasets to be unable to load both of these datasets simultaneously. We addressed this issue by adding an empty field in the 'train' dataset (GPT4 label, set to -1 to indicate the field is empty), and this resolved the problem. Now, with your code, the dataset can be downloaded successfully.
经过排查我们发现是因为train和valid的字段不一致导致datasets不能同时加载这两个数据集,我们通过在train中添加了一个空字段(GPT4 label,被设置为-1表示该字段为空)修复了这个问题,现在使用您的代码可以正常下载数据集。