aramis-lab/clinica

Add version and possibly other metadata to CAPS datasets

Opened this issue · 2 comments

The BIDS specifications define a dataset_description.json file at the root of the BIDS dataset which is responsible for defining some dataset-level metadata like the name of the dataset, the version of the BIDS specifications that the current dataset is supposed to follow and so on...

Currently, we do not have something similar for CAPS datasets. Since the evolution of the specs isn't necessarily on the same pace as the evolution of the clinica software, it makes sense to have a different versioning mechanism for the two objects.

Recent changes in the CAPS specifications (like the names of pet tracers, or the deduplication of pet suffixes for example), as well as foreseen changes are calling for such metadata.

Having a version number for the CAPS specifications would be a relatively easy addition which could be useful later on when attempting to update old datasets to a newer version of the specifications for example.

A basic implementation could look like this:

from clinica.utils.input import DatasetType
from clinica.utils.bids import BIDS_VERSION

CAPS_VERSION = "1.0.0"

class CAPSDatasetDescription:
    """Model representing a CAPS dataset description."""

    name: str
    bids_version: str = BIDS_VERSION
    caps_version: str = CAPS_VERSION
    dataset_type: DatasetType = DatasetType.DERIVATIVE

In addition, it'd be great to support the GeneratedBy and DatasetLinks keys from the BIDS specifications.

Maybe something like this for the caps output of T1Linear:

{
  "Name": "T1-Linear",
  "BIDSVersion": "1.7.0",
  "CAPSVersion": "1.0.0",
  "DatasetType": "derivative",
  "GeneratedBy": [
    {
      "Name": "clinica",
      "Version": "0.9.0",
      "CodeURL": "https://github.com/aramislab/clinica"
    }
  ],
  "SourceDatasets": [
        {
            "URL": "../bids"
        }
    ],
  "DatasetLinks": {
        "raw": "../bids/"
    }
}

With a little bit more thinking, there are some additional subtleties to consider:

  • Contrary to BIDS dataset, CAPS datasets can be input and/or output of pipelines
  • CAPS datasets can "grow" (i.e. the results of new pipelines get merged within an existing dataset). And this can happen at any time in theory (i.e. months or years later).

With this in mind, there are some design choices to do. For example, what should be the name of a CAPS dataset ?
I was aiming at using the pipeline name, but since there can be multiple pipeline outputs for a single CAPS dataset, it's not so straightforward.
Maybe "pipeline_1 + pipeline_2 + ..." ?
Generate a random identifier ?
Maybe the user should be able to specify the name (but in this case we'd need to change the CLI to allow name specification) ?

Also, I believe it is totally possible to have a single CAPS dataset containing the outputs of pipelines which used different BIDS / CAPS datasets as inputs.
We need to be able to handle those cases and trace the different source datasets that were used as inputs. I believe this is the responsibility of the DatasetLinks key, but to be verified...

Then, there could be version mismatches. For example, users could re-run a pipeline with a different version of the CAPS specifications than the one in the already present CAPS dataset provided as output.
Should we simply raise an error ? Try to convert the CAPS dataset to the newer version of the specifications ?

Finally, I believe the "GeneratedBy" key should be a little more verbose than just saying "clinica version X.Y.Z". Having the pipeline names with the version of clinica with which they were ran would be a good start.