huggingface/datasets

Problems after upgrading to 2.6.1

Opened this issue · 10 comments

Describe the bug

Loading a dataset_dict from disk with load_from_disk is now creating a KeyError "length" that was not occurring in v2.5.2.

Context:

  • Each individual dataset in the dict is created with Dataset.from_pandas
  • The dataset_dict is create from a dict of Datasets, e.g., `DatasetDict({"train": train_ds, "validation": val_ds})
  • The pandas dataframe, besides text columns, has a column with a dictionary inside and potentially different keys in each row. Correctly the Dataset.from_pandas function adds key: None to all dictionaries in each row so that the schema can be correctly inferred.

Steps to reproduce the bug

Steps to reproduce:

  • Upgrade to datasets==2.6.1
  • Create a dataset from pandas dataframe with Dataset.from_pandas
  • Create a dataset_dict from a dict of Datasets, e.g., `DatasetDict({"train": train_ds, "validation": val_ds})
  • Save to disk with the save function

Expected behavior

Same as in v2.5.2, that is load from disk without errors

Environment info

  • datasets version: 2.6.1
  • Platform: Linux-5.4.209-129.367.amzn2int.x86_64-x86_64-with-glibc2.26
  • Python version: 3.9.13
  • PyArrow version: 9.0.0
  • Pandas version: 1.5.1

Hi! I can't reproduce the error following these steps. Can you please provide a reproducible example?

I faced the same issue:

Repro

!pip install datasets==2.6.1
import datasets as Dataset
dataset = Dataset.from_pandas(dataframe)
dataset.save_to_disk(local)

!pip install datasets==2.5.2
import datasets as Dataset
dataset = Dataset.load_from_disk(local)

@Lokiiiiii And what are the contents of the "dataframe" in your example?

I bumped into the issue too. @Lokiiiiii thanks for steps. I "solved" if for now by pip install datasets>=2.6.1 everywhere.

Hi all,
I experienced the same issue.
Please note that the pull request is related to the IMDB example provided in the doc, and is a fix for that, in that context, to make sure that people can follow the doc example and have a working system.
It does not provide a fix for Datasets itself.

im getting the same error.

  • using the base AWS HF container that uses a datasets <2.
  • updating the AWS HF container to use dataset 2.4

Same here, running on our SageMaker pipelines. It's only happening for some but not all of our saved Datasets.

I am also receiving this error on Sagemaker but not locally, I have noticed that this occurs when the .dataset/ folder does not contain a single file like:

dataset.arrow

but instead contains multiple files like:

data-00000-of-00002.arrow
data-00001-of-00002.arrow

I think that it may have something to do with this recent PR that updated the behaviour of dataset.save_to_disk by introducing sharding: #5268

For now I can get around this by forcing datasets==2.8.0 on machine that creates dataset and in the huggingface instance for training (by running this at the start of training script os.system("pip install datasets==2.8.0"))

To ensure the dataset is a single shard when saving the dataset locally:

dataset.flatten_indices().save_to_disk('path/to/dataset', num_shards=1)

and then manually changing the name afterwards from path/to/dataset/data-00000-of-00001.arrow to path/to/dataset/dataset.arrow and updating the path/to/dataset/state.json to reflect this name change. i.e. by changing state.json to this:

{
  "_data_files": [
    {
      "filename": "dataset.arrow"
    }
  ],
  "_fingerprint": "420086f0636f8727",
  "_format_columns": null,
  "_format_kwargs": {},
  "_format_type": null,
  "_output_all_columns": false,
  "_split": null
}

Does anyone know if this has been resolved?

I have the same issue in datasets version 2.3.2