Better Dataset type enforcement and error messages for `aggregate_over_dataset` esp regarding `new_fingerprint` kwarg
Opened this issue · 0 comments
Is your feature request related to a problem? Please describe.
Anything that relies on aggregate_over_dataset assumes that what is passed in as dataset
is a huggingface Dataset
object rather than a DatasetDict
object. This is potentially problematic because if a user eg loads the dataset via dataset = datasets.load_dataset(path_to_dataset)
rather than dataset = datasets.Dataset.from_parquet(path_to_dataset)
then dataset
will be a DatasetDict
but while the map
function for DatasetDict
exists it does not have a new_fingerprint kwarg and so throws an opaque error.
Describe the solution you'd like
Add type hints to aggregate_over_dataset
. Throw an error that isn't so opaque if someone tries to pass in a Dataset
object rather than a DatasetDict
object.