mih/datalad-mihextras

Support RO-crate metadata

Closed this issue · 10 comments

mih commented

A good step towards #42 and RFD0041 in general would be to have a helper that can write out an RO-crate record for a DataLad dataset (https://w3id.org/ro/crate/1.1)

This could also be written to a DataLad dataset and make it an RO-crate itself.

I'm starting to explore this locally, will post updates here.

mih commented

I had to out it on hold, but this is likely going to happen via linkml

Thanks for the update, I'll bring linkml into my explorations. Started last week with its tutorial.

A dump of my explorations up until now:

Current idea is to work with the rocrate python package (https://github.com/ResearchObject/ro-crate-py) to be able to quickly put an RO-crate record together, but there is no strict requirement for this as long as the resulting record complies with https://w3id.org/ro/crate/1.1 standard, the core being:

  • the record results in a json file named ro-crate-metadata.json, with json-ld structured content for example:
{ "@context": "https://w3id.org/ro/crate/1.1/context",
  "@graph": [
    {
      "@type": "CreativeWork",
      "@id": "ro-crate-metadata.json",
      "conformsTo": {"@id": "https://w3id.org/ro/crate/1.1"},
      "about": {"@id": "./"}
    },  
    {
      "@id": "./",
      "@type": [
        "Dataset"
      ],
      "hasPart": [
        {
          "@id": "myfile.pdf"
        },
        ...
      ]
    },
    {
      "@id": "myfile.pdf",
      "@type": "File",
      "name": "Diagram of stuff",
      "contentSize": "383766",
      "contentUrl": "...",
      "description": "pdf file with some stuff",
      "encodingFormat": "application/pdf"
    },
    ...
  ]
}
  • the root dataset being described ("@id": "./") will have multiple hasPart elements describing the files and directories in the dataset, each also being described by its own so-called Data Entity element in the graph array.
  • the dataset being described can also have so-called Contextual Entities, each an element in the graph array. These would be additional descriptors of properties of the dataset, such as authors, name, etc.

First steps to put this together:

  1. Create an empty RO-crate:
  2. Iterate over the result of datalad.api.status(path='.') in order to add Data Entities:
    1. Ignore dot-files/-directories
    2. Create an element in hasPart and in the graph for each file (and each directory in the file path if it doesn't exist yet)
    3. Fill in properties from the result of the status record
    4. Fill in contentUrl (from where? git annex whereis?)
  3. Add Contextual Entities:
    • git contributors as authors?
    • (note to self: check whether there are REQUIRED contextual entities)

Some open questions:

  • How to handle container images (look specifically in .datalad/environments?)
  • How to handle subdatasets? should they first all be installed recursively before the status call, or handled separately with a datalad subdatasets or a foreach_dataset?
  • Will we standardise a source for more contextual entities? E.g. tabby?

After going through the LinkML tutorial and documentation, this is how I view its utility in the context of the process above:

  1. We can define our data model / schema in a YAML file using the LinkML approach.
    1. At first we will start with a model/schema reflecting the structure of an RO Crate
    2. This model will be expanded to include whichever contextual entities we feel are necessary to support our specific use cases (specifically, describing a datalad dataset, but perhaps more broadly: representing a generic dataset in a catalog)
  2. From the model/schema, we can generate python dataclasses and then use them together with the LinkML runtime to create actual data (json documents). This would be an alternative to, or actually an implementation with broader applicability than, using the rocrate library mentioned above.
  3. We wouldn't necessarily need it for this particular use case (generating an RO-crate from a datalad dataset), but we can also use the YAML specification of the model/schema for validating the json documents against.
  4. We can generate documentation from the schema.

Apart from this list, am I missing any other immediate benefits of using LinkML in this context?

mih commented

The difference is the approach. Going from datalad to ro-crate would further proliferate the direct use of a complex datalad/git/git-annex API combination for producing one specific type of metadata.

Producing ro-crate from a datalad dasets model would contain that to linkml, and also make it reusable for any other metadata types.

Are you implying that we should aim to not use the datalad stack at all in the process of generating such a metadata record? E.g. use python to traverse the dataset filetree? If so, something that I am uncertain about would be if there are alternative ways of getting information about the files, especially availability information stored by git annnex.

I'm first getting a better understanding of the ro-crate standard, and in order to do so I created a datalad-based script to create an ro-crate metadata record from a dataset: https://github.com/jsheunis/datalad-sexy-snippets/blob/main/tools/metadata/make_rocrate.py

When running this script on a local clone of the midterm project dataset (https://github.com/datalad-handbook/midterm_project), I get the following output:

> python tools/metadata/make_rocrate.py ../Data/rocrate-test/midterm_project

{
    "@context": "https://w3id.org/ro/crate/1.1/context",
    "@graph": [
        {
            "@type": "CreativeWork",
            "@id": "ro-crate-metadata.json",
            "conformsTo": {
                "@id": "https://w3id.org/ro/crate/1.1"
            },
            "about": {
                "@id": "./"
            }
        },
        {
            "@id": "./",
            "@type": "Dataset",
            "identifier": "",
            "datePublished": "",
            "name": "",
            "description": "",
            "license": {
                "@id": ""
            },
            "hasPart": [
                {
                    "@id": "CHANGELOG.md"
                },
                {
                    "@id": "README.md"
                },
                {
                    "@id": "code/"
                },
                {
                    "@id": "code/README.md"
                },
                {
                    "@id": "code/script.py"
                },
                {
                    "@id": "pairwise_relationships.png"
                },
                {
                    "@id": "prediction_report.csv"
                }
            ]
        },
        {
            "@id": "CHANGELOG.md",
            "@type": "File"
        },
        {
            "@id": "README.md",
            "@type": "File"
        },
        {
            "@id": "code/",
            "@type": "Dataset"
        },
        {
            "@id": "code/README.md",
            "@type": "File",
            "contentSize": 184
        },
        {
            "@id": "code/script.py",
            "@type": "File",
            "encodingFormat": "text/x-python",
            "contentSize": 1604
        },
        {
            "@id": "pairwise_relationships.png",
            "@type": "File"
        },
        {
            "@id": "prediction_report.csv",
            "@type": "File"
        }
    ]
}

You can see there are many properties without values (name, description, etc), and these could be populated in various ways, depending on how we define our data model and if we specify a standard source for such information not contained in a datalad dataset. There are also many absent contextual entities (e.g. authors) that could be readily populated from the information in a datalad dataset (config, submodules, etc), and this would also depend on how we on how we define our data model.

My next step is to work on these absent entities and their sources, and in parallel build the data model in the yaml file using LinkML approach.

Currently working on building a LinkML model for a datalad dataset, and trying to figure out if it's better to (1)first create a linkml model for an RO-crate and then import that into a separate linkml model for a datalad dataset, i.e. making the latter a subset of the former; or (2) create the linkml model for the datalad dataset and only bring in some properties (i.e. slots) from an ro-crate such that it is compatible as a by-product.

Unrelated: a starting point for slot definitions of the dataset model could be our tabby conventions, e.g.: https://docs.datalad.org/projects/tabby/en/latest/conventions/tby-ds1.html

Unrelated: a starting point for slot definitions of the dataset model could be our tabby conventions, e.g.: https://docs.datalad.org/projects/tabby/en/latest/conventions/tby-ds1.html

FTR: I've started with a dataset datamodel based on our tabby convention: jsheunis/datalad-sexy-snippets@da74a58

At the moment it's just a linkml version of the same tby-ds1 convention, with minor tweaks to the inclusion/use of some ontologies, and it still needs the author(s) slot + class.

I haven't checked if data validates with this. This is just to give an idea of what I'm working on and to invite any form of input.

mih commented

I am closing this in favor of psychoinformatics-de/datalad-concepts#61