octue/twined

Comments Welcome: Normalise manifest data structure

Closed this issue · 1 comments

Currently, a manifest data structure looks like this:

{
  "id": "8ead7669-8162-4f64-8cd5-4abe92509e17",
  "type": "input",
  "datasets": [
    {
      "id": "7ead7669-8162-4f64-8cd5-4abe92509e17",
      "tags": "met, mast, wind",
      "files": [
        {
          "path": "input/datasets/7ead7669/file_1.csv",
          "cluster": 0,
          "sequence": 0,
          "extension": "csv",
          "metadata": {},
          "tags": "",
          "posix_timestamp": null,
          "data_file": {
            "id": "abff07bc-7c19-4ed5-be6d-a6546eae8e86",
            "last_modified": "2019-02-28T22:40:30.533005Z",
            "name": "file_1.csv",
            "size_bytes": 59684813,
            "sha-512/256": "somesha"
          }
        },
        {
          "path": "input/datasets/7ead7669/file_2.csv",
          "cluster": 0,
          "sequence": 1,
          "extension": "csv",
          "metadata": {},
          "tags": "",
          "posix_timestamp": null,
          "data_file": {
            "id": "bbff07bc-7c19-4ed5-be6d-a6546eae8e45",
            "last_modified": "2019-02-28T22:40:40.633001Z",
            "name": "file_2.csv",
            "size_bytes": 59684813,
            "sha-512/256": "someothersha"
          }
        }
      ]
    }
  ]
}

However, this is a nested data structure, with manifest file records wrapping individual data files within the structure.

It may well be better to normalise the manifest, and have something like:

{
  ...manifest_data
  datasets: [
    {
       ...dataset_data
       files: [
         ...list of ids
       ],
     },
     ...
   ]
   files: [
     ...list of data_file records
   ]
   clusters: {...stuff...}
   sequences: {...stuff...}
}

Please let loose with discussion and use cases!

The 0.1.4 release of octue-sdk-python brings a Pathable mixin which is becoming quite powerful for relating manifests->datasets->datafiles (at first for just managing paths, but I envisage it developing into a more general relationship between them). This kind of tree structure means we need be less concerned about the normalised data sttructure suggested here.

Also, since I've opened this issue there's been no pressing need to address it, so will close now. Happy to reopen if anyone needs.