rwth-i6/returnn

`ConcatFilesDataset` needs a better name

NeoLegends opened this issue · 10 comments

#1521, #1519.

The name is misleading given all the other concatenation-related datasets we have, because people assume that the dataset will somehow modify the data itself when its true purpose is assisting in large-scale trainings. This hinders explainability.

A name focusing on the intent might make more sense to many people, like:

  • LazyLoadingDataset
  • ShardBySubepochDataset
  • maybe even LargeScaleDataset

Feel free to dump more ideas into this issue. Then we'll decide on a new name by June 11, 5pm CEST.

I like LazyLoadingDataset :)

Isn't it actually a wrapper of a dataset that allows for more efficient access to the internal dataset structure? Maybe then something like LazyLoadingDatasetWrapper?

LazyLoadingDatasetWrapper

No, all datasets must end with Dataset. Also MetaDataset, ConcatDataset, CombinedDataset, MultiProcDataset etc do, and they also all work the same way, i.e. wrap other datasets.

I prefer that the name describes how it operates rather than the intent. Currently this is the case for most of our datasets, functions, layers, everything. Thus sth like LargeScaleDataset is really too generic.

ShardBySubepochDataset is misleading, as there is no sharding. (At least not now. Maybe we might add this as another feature, via #1531.)

LazyLoadingDataset is also a bit too generic. Lazy loading happens in basically all datasets. It can mean anything. Sequences are always lazily loaded once needed (via load_seqs). MultiProcDataset also does a lot of lazy init stuff. Many other datasets also have lazy init to various degrees.

But what the dataset is doing (mainly) isn't concatenating other datasets. It is mainly about partitioning the files over the subepochs in a reasonable way to reduce the data load. Concatenation of the resulting files is a very unimportant detail of that process, yet it determines the name. Even the HDFDataset is a concatenation dataset in that regard, given that it can load more than one HDF at a time.

The point is the existing name is way too close to other datasets that have a meaningful concatenation operation for people to be able to clearly differentiate between them upon first glance.

New suggestion: PartitionDataBySubepochDataset

Some suggestions:

  • (For reference, RandomShuffleConcatDataset was the initial suggestion.)
  • ConcatFilesDataset, i.e. keep current name. Note that the original motivation for this name was the strong relation to ConcatDataset, which basically does the same thing conceptually.
  • FileDistribDataset. "distribution" can refer to distribution of the files over the sub epochs. But at the same time, it can also refer to distributed training.
  • FileListDataset
  • FilesMetaDataset

I am fine with FileDistribDataset

The point is the existing name is way too close to other datasets that have a meaningful concatenation operation for people to be able to clearly differentiate between them upon first glance.

Which exactly do you mean? There are indeed a couple of datasets which can be confusing. However, they all do different things. E.g.:

  • MetaDataset: combines multiple datasets, num seqs stays same as one single, just adds the data keys
  • ConcatDataset: combines multiple datasets, num seqs is the sum over all datasets, data keys must be the same over all datasets
  • CombinedDataset: combines multiple datasets, num seqs is the sum over all datasets, data keys can be different over the datasets
  • ConcatSeqsDataset: operates on a single dataset, concatenates sequences, i.e. num seqs is less than the original dataset

Conceptually, ConcatFilesDataset is exactly the same as ConcatDataset. The others are all different.

I also like FileDistribDataset, or even more complete for clarity, FileDistributedDataset.