huggingface/datasets

Fallback to arrow defaults when loading dataset with custom features that aren't registered locally

alex-hh opened this issue · 0 comments

Describe the bug

Datasets allows users to create and register custom features.

However if datasets are then pushed to the hub, this means that anyone calling load_dataset without registering the custom Features in the same way as the dataset creator will get an error message.

It would be nice to offer a fallback in this case.

Steps to reproduce the bug

load_dataset("alex-hh/custom-features-example")

(Dataset creation process - must be run in separate session so that NewFeature isn't registered in session in which download is attempted:)

from dataclasses import dataclass, field
import pyarrow as pa
from datasets.features.features import register_feature

from datasets import Dataset, Features, Value, load_dataset
from datasets import Feature

@dataclass
class NewFeature(Feature):
    _type: str = field(default="NewFeature", init=False, repr=False)
    def __call__(self):
        return pa.int32()

def examples_generator():
    for i in range(5):
        yield {"feature": i}

ds = Dataset.from_generator(examples_generator, features=Features(feature=NewFeature()))
ds.push_to_hub("alex-hh/custom-features-example")
register_feature(NewFeature, "NewFeature")

Expected behavior

It would be nice, and offer greater extensibility, if there was some kind of graceful fallback mechanism in place for cases where user-defined features are stored in the dataset but not available locally.

Environment info

3.0.2