`ConcatFilesDataset` needs a better name
NeoLegends opened this issue · 10 comments
The name is misleading given all the other concatenation-related datasets we have, because people assume that the dataset will somehow modify the data itself when its true purpose is assisting in large-scale trainings. This hinders explainability.
A name focusing on the intent might make more sense to many people, like:
LazyLoadingDataset
ShardBySubepochDataset
- maybe even
LargeScaleDataset
Feel free to dump more ideas into this issue. Then we'll decide on a new name by June 11, 5pm CEST.
I like LazyLoadingDataset
:)
Isn't it actually a wrapper of a dataset that allows for more efficient access to the internal dataset structure? Maybe then something like LazyLoadingDatasetWrapper
?
LazyLoadingDatasetWrapper
No, all datasets must end with Dataset
. Also MetaDataset
, ConcatDataset
, CombinedDataset
, MultiProcDataset
etc do, and they also all work the same way, i.e. wrap other datasets.
I prefer that the name describes how it operates rather than the intent. Currently this is the case for most of our datasets, functions, layers, everything. Thus sth like LargeScaleDataset
is really too generic.
ShardBySubepochDataset
is misleading, as there is no sharding. (At least not now. Maybe we might add this as another feature, via #1531.)
LazyLoadingDataset
is also a bit too generic. Lazy loading happens in basically all datasets. It can mean anything. Sequences are always lazily loaded once needed (via load_seqs
). MultiProcDataset
also does a lot of lazy init stuff. Many other datasets also have lazy init to various degrees.
But what the dataset is doing (mainly) isn't concatenating other datasets. It is mainly about partitioning the files over the subepochs in a reasonable way to reduce the data load. Concatenation of the resulting files is a very unimportant detail of that process, yet it determines the name. Even the HDFDataset
is a concatenation dataset in that regard, given that it can load more than one HDF at a time.
The point is the existing name is way too close to other datasets that have a meaningful concatenation operation for people to be able to clearly differentiate between them upon first glance.
New suggestion: PartitionDataBySubepochDataset
Some suggestions:
- (For reference,
RandomShuffleConcatDataset
was the initial suggestion.) ConcatFilesDataset
, i.e. keep current name. Note that the original motivation for this name was the strong relation toConcatDataset
, which basically does the same thing conceptually.FileDistribDataset
. "distribution" can refer to distribution of the files over the sub epochs. But at the same time, it can also refer to distributed training.FileListDataset
FilesMetaDataset
I am fine with FileDistribDataset
The point is the existing name is way too close to other datasets that have a meaningful concatenation operation for people to be able to clearly differentiate between them upon first glance.
Which exactly do you mean? There are indeed a couple of datasets which can be confusing. However, they all do different things. E.g.:
MetaDataset
: combines multiple datasets, num seqs stays same as one single, just adds the data keysConcatDataset
: combines multiple datasets, num seqs is the sum over all datasets, data keys must be the same over all datasetsCombinedDataset
: combines multiple datasets, num seqs is the sum over all datasets, data keys can be different over the datasetsConcatSeqsDataset
: operates on a single dataset, concatenates sequences, i.e. num seqs is less than the original dataset
Conceptually, ConcatFilesDataset
is exactly the same as ConcatDataset
. The others are all different.
I also like FileDistribDataset
, or even more complete for clarity, FileDistributedDataset
.