Deprecate `validate_dataset` and `regenerate_dataset`
Closed this issue · 0 comments
These APIs are pretty complex (relying on a lot of advanced dask and pyarrow code) and are therefore difficult to maintain. The validate_dataset
utility also relies on deprecated pyarrow behavior that is no longer supported in a rapids-24.04 environment.
The regenerate_dataset
is marked as experimental, but validate_dataset
is not.
@rnyak - How important is it to preserve these utilities?
Primary motivation: #371 (comment)
Further perspective: I strongly prefer that we try to deprecate/remove as much IO code as possible. Dask-Dataframe IO is much more stable than it was at the time the merlin.io module was developed. Once merlin adopts query-planning, the dask-dataframe api will actually be much better. I don't think the Dataset API should be thrown away. However, I do think it needs to be slimmed down as much as possible.