som-shahlab/femr

Better Dataset type enforcement and error messages for `aggregate_over_dataset` esp regarding `new_fingerprint` kwarg

Opened this issue · 0 comments

Is your feature request related to a problem? Please describe.
Anything that relies on aggregate_over_dataset assumes that what is passed in as dataset is a huggingface Dataset object rather than a DatasetDict object. This is potentially problematic because if a user eg loads the dataset via dataset = datasets.load_dataset(path_to_dataset) rather than dataset = datasets.Dataset.from_parquet(path_to_dataset) then dataset will be a DatasetDict but while the map function for DatasetDict exists it does not have a new_fingerprint kwarg and so throws an opaque error.

Describe the solution you'd like
Add type hints to aggregate_over_dataset. Throw an error that isn't so opaque if someone tries to pass in a Dataset object rather than a DatasetDict object.