ENH: Rename `datasets` to `pipes`, that have built-in transforms
NickleDave opened this issue · 2 comments
We will eventually have a datasets module but what we really have right now are "pipes" in the sense of https://github.com/pytorch/data#what-are-datapipes. We should disambiguate the meaning of "dataset" (static set of files) from the pipelines we use to load files from a dataset.
Note that torchdata
development is on hold. I think we should just move our internals under a new namespace for now.
Related to the issue @wendtalexander ran into in #725:
I think as part of this renaming / refactoring, we can avoid the need for a user just running standard models with configs to have to understand the nuances of transforms / "dataset" classes, etc, by making each data pipe include a fixed transform, and then including the parameters for that transform in the parameters of the pipe itself.
For example: vak.datasets.frame_classification.WindowDataset
and FramesDataset
will become
vak.datapipes.frame_classification.TrainDataPipe
and EvalDataPipe
respectively. Both will have a window_size
parameter, that for the TrainDataPipe
will get used when loading windows from the dataset, whereas for the EvalDataPipe
it will be used for the transform that makes entire spectrograms into batches of windows.
This would obviate the need for dataset_params
, transform_params
, train_dataset_params
, val_dataset_params
and instead we'd just have datapipe_kwargs
.
In a config this would look something like:
[vak.train.frame_classification.pipe]
window_size = 176
I prefer just consistently naming them something like TrainDataPipe
for all families so that way I don't spend a bunch of time thinking about what the most appropriate name for them is. We can explain details of how data is statically represented + dynamically transformed in the docstring.
Closed by #755