Defining a large dataset with several splits under it

Question

Defining a large dataset with several splits under it

Opened this issue a year ago · 1 comments

We have a recurrent format of some datasets where the same dataset will have multiple splits under each, where splits are different by language, subtask, train-dev-test, etc. but have the same file structure.
Our current implementation assumes we will have one dataset per split (especially with the metadata specifying dataset language for example), OR we will have an ad-hoc method of using the same dataset class but passing different splits file names with different assets.

I think we should find a more unified way to handle such cases (e.g., parent dataset and subsets under each, where subsets are different by metadata only for example).

Answer 1 · 2023-09-04T07:40:18.000Z

This is a good suggestion, we can do several things here:

Have a single ParentDataset which needs a dataset arg that "sets" a particular lang/split; metadata in this case would highlight the multilingual nature
Have a single ParentDataset class like above + several child classes that inherit from the parent class, with only a single line of implementation that calls the parent class constructor with a particular language/task set. E.g of this would be a parent CT22Dataset class and a child CT22CheckworthinessDatasetclass. The child classes in this case will just be convenience/syntactic sugar, but might be useful imo