iSEE/iSEEindex

Reusing configurations across datasets

tomsing1 opened this issue · 2 comments

When we have a large number of datasets of similar type (e.g. bulk RNA-seq experiments, each with associated differential gene expression results), it might be useful to apply the same configuration to each of them.

For example, we might have 3 generic configurations in three different iSEE setups:

  1. Panels to explore normalized gene expression (logcounts)
  2. Panels to explore a single contrast from differential expression analysis (DEA)
  3. Panels to compare two contrasts to each other

Right now, it seems that configurations that are shown to the user must be defined for each dataset - and the config_id must be unique. If I understand correctly, that means I have to provide the 3 configs for each dataset, just with a different config_id? If that's true, then it leads to a lot of duplication in the config files.

Instead of defining unique config_id and dataset_id pairings in the YAML file, perhaps we could define configs without reference to a dataset. And then a dataset could receive a list of one or more config_ids that are associated with it - regardless of whether other datasets use the same ones or not?

Finally getting to think about this one. I agree with the idea of reducing redundancy by defining configurations separately from data sets, and separately listing configurations available for each data set.

I can see two options.

  1. Define three items: 1) table of data sets, 2) table of configurations, 3) table mapping data sets and configurations.
  2. Define two items: 1) table of configurations, 2) table of data sets including a column listing configurations associated with each data set.

Option 1 is very database-like and possibly overkill. I'm tempted by the idea of keeping the mapping separate from the table, as the table is displayed to the users (at least part of it) while the mapping isn't (it is used to populate the dropdown of configurations when users choose a data set).

That said, the side note "(at least part of it)" above indicates that there are already some "private" columns of metadata in the table of data sets, which means that option 2 is not that far fetched. It could be a column of comma separated identifiers for the configuration associated with the dataset.

I guess at this point, it just depends how many configurations one might expect. Too many and the table column might get unwieldy and painful to maintain. That said, I haven't heard feedback from anyone designing more than a handful of configurations per data set, so I guess we can start with option 2, and revisit in the future if needed.

Thoughts?

Fixed by #37