NVIDIA-Merlin/core

Deprecate `validate_dataset` and `regenerate_dataset`

Closed this issue · 0 comments

These APIs are pretty complex (relying on a lot of advanced dask and pyarrow code) and are therefore difficult to maintain. The validate_dataset utility also relies on deprecated pyarrow behavior that is no longer supported in a rapids-24.04 environment.

The regenerate_dataset is marked as experimental, but validate_dataset is not.

@rnyak - How important is it to preserve these utilities?

Primary motivation: #371 (comment)

Further perspective: I strongly prefer that we try to deprecate/remove as much IO code as possible. Dask-Dataframe IO is much more stable than it was at the time the merlin.io module was developed. Once merlin adopts query-planning, the dask-dataframe api will actually be much better. I don't think the Dataset API should be thrown away. However, I do think it needs to be slimmed down as much as possible.