Problems after upgrading to 2.6.1

Question

Problems after upgrading to 2.6.1

Opened this issue 2 years ago · 10 comments

Describe the bug

Loading a dataset_dict from disk with load_from_disk is now creating a KeyError "length" that was not occurring in v2.5.2.

Context:

Each individual dataset in the dict is created with Dataset.from_pandas
The dataset_dict is create from a dict of Datasets, e.g., `DatasetDict({"train": train_ds, "validation": val_ds})
The pandas dataframe, besides text columns, has a column with a dictionary inside and potentially different keys in each row. Correctly the Dataset.from_pandas function adds key: None to all dictionaries in each row so that the schema can be correctly inferred.

Steps to reproduce the bug

Steps to reproduce:

Upgrade to datasets==2.6.1
Create a dataset from pandas dataframe with Dataset.from_pandas
Create a dataset_dict from a dict of Datasets, e.g., `DatasetDict({"train": train_ds, "validation": val_ds})
Save to disk with the save function

Expected behavior

Same as in v2.5.2, that is load from disk without errors

Environment info

datasets version: 2.6.1
Platform: Linux-5.4.209-129.367.amzn2int.x86_64-x86_64-with-glibc2.26
Python version: 3.9.13
PyArrow version: 9.0.0
Pandas version: 1.5.1

Answer 1 · 2022-10-24T18:39:58.000Z

Hi! I can't reproduce the error following these steps. Can you please provide a reproducible example?

Answer 2 · 2022-10-24T19:12:15.000Z

I faced the same issue:

Repro

!pip install datasets==2.6.1
import datasets as Dataset
dataset = Dataset.from_pandas(dataframe)
dataset.save_to_disk(local)

!pip install datasets==2.5.2
import datasets as Dataset
dataset = Dataset.load_from_disk(local)

Answer 3 · 2022-10-25T12:10:32.000Z

@Lokiiiiii And what are the contents of the "dataframe" in your example?

Answer 4 · 2022-10-26T16:33:32.000Z

I bumped into the issue too. @Lokiiiiii thanks for steps. I "solved" if for now by pip install datasets>=2.6.1 everywhere.

Answer 5 · 2022-11-18T10:37:56.000Z

Hi all,
I experienced the same issue.
Please note that the pull request is related to the IMDB example provided in the doc, and is a fix for that, in that context, to make sure that people can follow the doc example and have a working system.
It does not provide a fix for Datasets itself.

Answer 6 · 2022-12-08T20:26:20.000Z

im getting the same error.

using the base AWS HF container that uses a datasets <2.
updating the AWS HF container to use dataset 2.4

Answer 7 · 2022-12-16T15:59:54.000Z

Same here, running on our SageMaker pipelines. It's only happening for some but not all of our saved Datasets.

Answer 8 · 2022-12-22T19:18:37.000Z

I am also receiving this error on Sagemaker but not locally, I have noticed that this occurs when the .dataset/ folder does not contain a single file like:

dataset.arrow

but instead contains multiple files like:

data-00000-of-00002.arrow
data-00001-of-00002.arrow

I think that it may have something to do with this recent PR that updated the behaviour of dataset.save_to_disk by introducing sharding: #5268

For now I can get around this by forcing datasets==2.8.0 on machine that creates dataset and in the huggingface instance for training (by running this at the start of training script os.system("pip install datasets==2.8.0"))

To ensure the dataset is a single shard when saving the dataset locally:

dataset.flatten_indices().save_to_disk('path/to/dataset', num_shards=1)

and then manually changing the name afterwards from path/to/dataset/data-00000-of-00001.arrow to path/to/dataset/dataset.arrow and updating the path/to/dataset/state.json to reflect this name change. i.e. by changing state.json to this:

{
  "_data_files": [
    {
      "filename": "dataset.arrow"
    }
  ],
  "_fingerprint": "420086f0636f8727",
  "_format_columns": null,
  "_format_kwargs": {},
  "_format_type": null,
  "_output_all_columns": false,
  "_split": null
}

Answer 9 · 2023-12-14T14:20:26.000Z

Does anyone know if this has been resolved?

Answer 10 · 2024-05-12T07:40:02.000Z

I have the same issue in datasets version 2.3.2