Problems after upgrading to 2.6.1
Opened this issue · 10 comments
Describe the bug
Loading a dataset_dict from disk with load_from_disk
is now creating a KeyError "length"
that was not occurring in v2.5.2.
Context:
- Each individual dataset in the dict is created with
Dataset.from_pandas
- The dataset_dict is create from a dict of
Dataset
s, e.g., `DatasetDict({"train": train_ds, "validation": val_ds}) - The pandas dataframe, besides text columns, has a column with a dictionary inside and potentially different keys in each row. Correctly the
Dataset.from_pandas
function addskey: None
to all dictionaries in each row so that the schema can be correctly inferred.
Steps to reproduce the bug
Steps to reproduce:
- Upgrade to datasets==2.6.1
- Create a dataset from pandas dataframe with
Dataset.from_pandas
- Create a dataset_dict from a dict of
Dataset
s, e.g., `DatasetDict({"train": train_ds, "validation": val_ds}) - Save to disk with the
save
function
Expected behavior
Same as in v2.5.2, that is load from disk without errors
Environment info
datasets
version: 2.6.1- Platform: Linux-5.4.209-129.367.amzn2int.x86_64-x86_64-with-glibc2.26
- Python version: 3.9.13
- PyArrow version: 9.0.0
- Pandas version: 1.5.1
Hi! I can't reproduce the error following these steps. Can you please provide a reproducible example?
I faced the same issue:
Repro
!pip install datasets==2.6.1
import datasets as Dataset
dataset = Dataset.from_pandas(dataframe)
dataset.save_to_disk(local)
!pip install datasets==2.5.2
import datasets as Dataset
dataset = Dataset.load_from_disk(local)
@Lokiiiiii And what are the contents of the "dataframe" in your example?
I bumped into the issue too. @Lokiiiiii thanks for steps. I "solved" if for now by pip install datasets>=2.6.1
everywhere.
Hi all,
I experienced the same issue.
Please note that the pull request is related to the IMDB example provided in the doc, and is a fix for that, in that context, to make sure that people can follow the doc example and have a working system.
It does not provide a fix for Datasets itself.
im getting the same error.
- using the base AWS HF container that uses a datasets <2.
- updating the AWS HF container to use dataset 2.4
Same here, running on our SageMaker pipelines. It's only happening for some but not all of our saved Datasets.
I am also receiving this error on Sagemaker but not locally, I have noticed that this occurs when the .dataset/
folder does not contain a single file like:
dataset.arrow
but instead contains multiple files like:
data-00000-of-00002.arrow
data-00001-of-00002.arrow
I think that it may have something to do with this recent PR that updated the behaviour of dataset.save_to_disk
by introducing sharding: #5268
For now I can get around this by forcing datasets==2.8.0 on machine that creates dataset and in the huggingface instance for training (by running this at the start of training script os.system("pip install datasets==2.8.0")
)
To ensure the dataset is a single shard when saving the dataset locally:
dataset.flatten_indices().save_to_disk('path/to/dataset', num_shards=1)
and then manually changing the name afterwards from path/to/dataset/data-00000-of-00001.arrow
to path/to/dataset/dataset.arrow
and updating the path/to/dataset/state.json
to reflect this name change. i.e. by changing state.json
to this:
{
"_data_files": [
{
"filename": "dataset.arrow"
}
],
"_fingerprint": "420086f0636f8727",
"_format_columns": null,
"_format_kwargs": {},
"_format_type": null,
"_output_all_columns": false,
"_split": null
}
Does anyone know if this has been resolved?
I have the same issue in datasets version 2.3.2