ENH: Rename `datasets` to `pipes`, that have built-in transforms

Question

ENH: Rename `datasets` to `pipes`, that have built-in transforms

NickleDave opened this issue a year ago · 2 comments

We will eventually have a datasets module but what we really have right now are "pipes" in the sense of https://github.com/pytorch/data#what-are-datapipes. We should disambiguate the meaning of "dataset" (static set of files) from the pipelines we use to load files from a dataset.

Note that torchdata development is on hold. I think we should just move our internals under a new namespace for now.

Answer 1 · 2023-10-23T02:36:26.000Z

Related to the issue @wendtalexander ran into in #725:
I think as part of this renaming / refactoring, we can avoid the need for a user just running standard models with configs to have to understand the nuances of transforms / "dataset" classes, etc, by making each data pipe include a fixed transform, and then including the parameters for that transform in the parameters of the pipe itself.

For example: vak.datasets.frame_classification.WindowDataset and FramesDataset will become
vak.datapipes.frame_classification.TrainDataPipe and EvalDataPipe respectively. Both will have a window_size parameter, that for the TrainDataPipe will get used when loading windows from the dataset, whereas for the EvalDataPipe it will be used for the transform that makes entire spectrograms into batches of windows.

This would obviate the need for dataset_params, transform_params, train_dataset_params, val_dataset_params and instead we'd just have datapipe_kwargs.

In a config this would look something like:

[vak.train.frame_classification.pipe]
window_size = 176

I prefer just consistently naming them something like TrainDataPipe for all families so that way I don't spend a bunch of time thinking about what the most appropriate name for them is. We can explain details of how data is statically represented + dynamically transformed in the docstring.

Answer 2 · 2024-05-11T12:55:01.000Z

Closed by #755