`ConcatFilesDataset` combines poorly with `MetaDataset`

Question

`ConcatFilesDataset` combines poorly with `MetaDataset`

NeoLegends opened this issue 7 months ago · 6 comments

I'm working on running a test config using ConcatFilesDataset (#1521, #1519).

The original setup I have uses MetaDataset to load the features and the alignment targets from two distinct sets of HDFs. This makes integrating the ConcatFilesDataset pretty difficult because it cannot deal with heterogenous data. Take the following example:

train = {
    "class": "MetaDataset",
    "data_map": {"classes": ("alignments", "data"), "data": ("features", "data")},
    "datasets": {
        "alignments": {
            "class": "HDFDataset",
            "files": [
                "/alignment/files.hdf",
                "/..."
            ],
            "partition_epoch": 250,
            "seq_ordering": "random",
        },
        "features": {
            "class": "HDFDataset",
            "files": [
                "/feature/files.hdf",
                "/..."
            ],
        },
    },
    "seq_order_control_dataset": "alignments",
}

Where do we integrate the ConcatFilesDataset?

Thoughts:

We can place ConcatFilesDataset around the MetaDataset. This is problematic because we a) cannot give both the alignment caches and the feature caches to the ConcatFilesDataset's files list, as they are fundamentally different in size and heterogenous in contents. We b) could give the ConcatFilesDataset only the features and always include all alignment caches in the MetaDataset emitted by get_sub_epoch_dataset. This would work but would be slow as we'd have to reload all alignment caches for every subepoch. We could also c) try and compute the relevant alignment caches given a list of feature caches in get_sub_epoch_dataset, but I'm not sure we can assume this always works. It might be worth precomputing this info though.
We can place ConcatFilesDataset inside the MetaDataset where the HDFDatasets currently are. This works as long as only one of the datasets is a ConcatFilesDataset, because it is the one that needs to be the seq_order_control_dataset (it does not accept a seq_order from outside 02fa44b#diff-6f16d6edae7b1113b8c292acded8b6ab78875f54dca08cd58edd11110709372fR189-R190). This means we can only use ConcatFilesDataset for one of the data streams, but the advantage over option 1b) is that we don't need to reload the HDFs for the other data stream after every subepoch. This might be enough if only one stream is data-heavy (features) and the other ones are small enough they can be loaded the normal way (targets). If we want both datasets to be ConcatFilesDataset to leverage the advanced prefetching and caching behavior we can precompute a mapping between sequence ID <-> containing file path. Then one of the datasets can be the seq_order_control_dataset and the others use that mapping to load up the data files and set up the sub dataset for precisely the relevant segments on-demand.
There might be issues with partition_epoch being specified on both the parent MetaDataset and on the ConcatFilesDataset but I haven't put much thought into that yet.

WDYT? Is there anything I'm missing?

Answer 1 · 2024-06-05T15:18:46.000Z

ConcatFilesDataset is intended to be around everything else if you have a hierarchy of multiple datasets (like here with MetaDataset). It's the only possible way because otherwise it must be able to operate on the whole dataset at once, which is not really possible with ConcatFilesDataset, which is exactly the point. E.g. init_seq_order will get seq_list or seq_order otherwise, which will fail.

This is problematic because we a) cannot give both the alignment caches and the feature caches to the ConcatFilesDataset's files list, as they are fundamentally different in size and heterogenous in contents.

Oh yea I did not think about this before. But fortunately this is simple to extend: ConcatFilesDataset's files could be a list of an arbitrarily nested structure (e.g. tuple or dict), where all the leaves should be files (i.e. just str). So in this specific case, you could give it list[tuple[str,str]], or maybe use list[dict[str,str]], where all the dicts would have "alignments": file and "features": file. Then get_sub_epoch_dataset would get a subset of that list. Calculating the size would just iterate over all leaves. This should be all fairly straightforward with the tree package (which we use in other places, e.g. also in FileCache, see import tree).

Do you want to make a PR for that?

Answer 2 · 2024-06-05T15:25:08.000Z

Or, another possibility which should already work right now: In ConcatFilesDataset's files, you only specify one of them, either features or alignments. I assume you can infer the filename of the other one, i.e. given an features filename, I assume you can infer the filename of the alignments. Or otherwise you can also have a dict in your config, sth like alignment_file_per_feature_file, which maps one to the other. Then in get_sub_epoch_dataset, you will get one of it, but you can infer the other.

But I think the extension as discussed in my previous post makes sense anyway. It's probably cleaner.

Answer 3 · 2024-06-05T15:28:41.000Z

I like the tree idea. I'm going to think about that for a bit and run a PR. I think it still requires consistently prepared HDFs or a mapping that's known beforehand.

Or, another possibility which should already work right now: In ConcatFilesDataset's files, you only specify one of them, either features or alignments. I assume you can infer the filename of the other one, i.e. given an features filename, I assume you can infer the filename of the alignments. Or otherwise you can also have a dict in your config, sth like alignment_file_per_feature_file, which maps one to the other. Then in get_sub_epoch_dataset, you will get one of it, but you can infer the other.

Yes I wrote this in my large comment, this would work if both types of caches are prepared in a consistent way or if the mapping is known from somewhere in advance. But it's very cumbersome and probably also brittle from setup to setup. I think in that case I'd rather re-dump all the data into a new, consistent set of HDFs and drop the meta dataset altogether.

Answer 4 · 2024-06-05T15:38:04.000Z

this would work if both types of caches are prepared in a consistent way or if the mapping is known from somewhere in advance

But also the approach with tree assumes there is such a mapping, which you also know in advance, or not?

this would work but would be slow as we'd have to reload all alignment caches for every subepoch.

I don't understand this. Why do you need to load them all?

Because you actually don't have individual alignment files but only one single alignment file? But the tree-approach would have the same problem then. I think the best solution is that you split up the HDFs, consistent to the features.
Or if you already have individual alignments, I don't understand your comment. Why do you need to load them all?

Answer 5 · 2024-06-05T15:42:52.000Z

But also the approach with tree assumes there is such a mapping, which you also know in advance, or not?

Yes I edited my comment, did not have this in before.

I don't understand this. Why do you need to load them all?

I think I was unclear here, I did not mean loading the data, but opening the HDFs, reading out which sequences are stored in which HDFs, etc. All the prep-work.

If you only had a single alignment HDF this would apply to the tree approach as well, yes. I think in that case it might really be helpful to have the concat dataset not operating as primary dataset, but as a sub dataset of the meta dataset, because this avoids redoing the initial prep work for the alignments.

I think it all depends on how large the alignment dataset is. It's probably not very large. Anyways, we should support the tree use case. I believe we often have consistently prepared data. I'm PRing this.

Answer 6 · 2024-06-05T16:04:27.000Z

I think I was unclear here, I did not mean loading the data, but opening the HDFs, reading out which sequences are stored in which HDFs, etc. All the prep-work.

I also don't understand this. Why? You mean under the assumption that you don't know the mapping in advance? But if this is the case, there you don't know the mapping, or that there is no clear mapping, then also the tree approach would not work.

Or if you have the mapping, you don't need to open all the HDFs but just the right one. But then the tree approach or doing the mapping logic inside get_sub_epoch_dataset would basically lead to the same solution, only implemented slightly differently.

I think the only reasonable case is that you have a clear mapping, and you know it in advance. If this is not the case yet, it should be easy to prepare it that way.