psychoinformatics-de/datalad-concepts

Support RO-crate specification

Opened this issue · 4 comments

Read the following for context:

The question for me is what role the RO-crate specification should play in defining the dataset schema in LinkML. From the issue that's linked above:

Currently working on building a LinkML model for a datalad dataset, and trying to figure out if it's better to
(1) first create a linkml model for an RO-crate and then import that into a separate linkml model for a datalad dataset, i.e. making the latter a subset of the former; or
(2) create the linkml model for the datalad dataset and only bring in some properties (i.e. slots) from an ro-crate such that it is compatible as a by-product.

To add to (2): one could build a model that is completely separate from RO-crate and purely follows what we see as the ideal for a datalad dataset metadata structure, and then bring in compatibility with RO-crate as a separate tool, i.e. exporting to the RO-crate specification would be one of many supported "translation" options.

@mih @mslw @christian-monch curious to hear your thoughts on this

mih commented

Exactly. An ro-crate should be an export of a datalad data model for a single version of a dataset.

mih commented

Starting with an effort to model an RO-crate with linkml. It seems the first step would be to decide on a good input representation.

Initially, I thought it would be good to take an RO-crate and frame it with something like

    {
      "@context": "https://w3id.org/ro/crate/1.1/context",
      "@type": "http://schema.org/Dataset"
    }

to get a hierarchical representation. However, this ruins the deduplicating nature of an RO-crate (array of elementary object definitions, ie. an author person appears only once in a record). Moreover, linkml IO tooling will strip anything that starts with @, including @id -- which is essential in an RO-crate, because it represents the "filename/location" in a dataset.

Maybe it would be better to use something like this

{
  "@context": "http://schema.org/",
  "@graph": [
    {
      "id": "ro-crate-metadata.json",
      "type": "CreativeWork",
      "dct:conformsTo": {
        "id": "https://w3id.org/ro/crate/1.1"
      },
      "about": {
        "id": "./"
      },
      "description": "RO-Crate Metadata File Descriptor (this file)"
    },
    {
      "id": "./",
      "type": "Dataset",
      "description": "The RO-Crate Root Data Entity",
      "hasPart": [
        {
          "id": "data1.txt"
        },
        {
          "id": "data2.txt"
        }
      ],
      "name": "Example RO-Crate"
    },
    {
      "id": "data1.txt",
      "type": "MediaObject",
      "author": {
        "id": "#alice"
      },
      "contentLocation": {
        "id": "http://sws.geonames.org/8152662/"
      },
      "description": "One of hopefully many Data Entities"
    },
    {
      "id": "data2.txt",
      "type": "MediaObject"
    },
    {
      "id": "#alice",
      "type": "Person",
      "description": "One of hopefully many Contextual Entities",
      "name": "Alice"
    },
    {
      "id": "http://sws.geonames.org/8152662/",
      "type": "Place",
      "name": "Catalina Park"
    }
  ]
}

This is a plain RO-crate passed through JSON-LD compaction with the context

{
  "@context": "http://schema.org/"
}

We could now process @graph only. However, with a complex RO-crate this may not work, because mixing context sources yields something like this:

{
  "@context": "https://w3id.org/ro/crate/1.1/context",
  "@graph": [
    {
      "@id": "ro-crate-metadata.json",
      "@type": "CreativeWork",
      "conformsTo": {
        "@id": "https://w3id.org/ro/crate/1.1"
      },
      "about": {
        "@id": "./"
      },
      "description": "RO-Crate Metadata File Descriptor (this file)"
    },
...

with context

{
  "@context": "https://w3id.org/ro/crate/1.1/context"
}

Maybe we need a custom pre-processor...

mih commented

#94 demos a first few RO-crate specification features in our setup.