monarch-initiative/koza

required_properties for nested JSON files tags

Closed this issue · 3 comments

In some Koza configuration YAML files, like some specifying parsing of Alliance schema compliant input data, we need a way to specify nested JSON tags as required properties, e.g.

required_properties:
  - metaData.crossReference.id
  - geneId
  - whenExpressed.stageTermId
  - assay
  - whereExpressed.cellularComponentTermId
  - whereExpressed.anatomicalStructureTermId
  - crossReference.id
  - evidence.publicationId

given a typical Alliance schema-driven gene expression site data file entry as follows:

{
        "dateAssigned": "2022-01-21T07:09:02-08:00",
        "geneId": "ZFIN:ZDB-GENE-031222-3",
        "evidence": {
            "crossReference": {"id": "ZFIN:ZDB-PUB-080616-21", "pages": ["reference"]},
            "publicationId": "PMID:18544660",
        },
        "crossReference": {
            "id": "ZFIN:ZDB-FIG-080908-4",
            "pages": ["gene/expression/annotation/detail"],
        },
        "assay": "MMO:0000655",
        "whenExpressed": {
            "stageName": "Larval:Protruding-mouth",
            "stageTermId": "ZFS:0000035",
            "stageUberonSlimTerm": {"uberonTerm": "post embryonic, pre-adult"},
        },
        "whereExpressed": {
            "whereExpressedStatement": "whole organism",
            "anatomicalStructureTermId": "ZFA:0001094",
            "anatomicalStructureUberonSlimTermIds": [{"uberonTerm": "Other"}],
        },
    }

Unfortunately, such a specification does not yet work, giving the following error:

ValueError: Required properties defined for alliance_gene_to_expression are missing from data\alliance\EXPRESSION_SGD.json.gz
Missing properties: {'evidence.publicationId', 'whenExpressed.stageTermId', 'crossReference.id', 'whereExpressed.anatomicalStructureTermId', 'whereExpressed.cellularComponentTermId', 'metaData.crossReference.id'}
Row: ...

Or perhaps (as a second exemplar):

required_properties:
  - 'objectId'
  - 'phenotypeTermIdentifiers'
  - 'evidence'
  - 'phenotypeTermIdentifiers[0]['termId']
  - 'evidence.publicationId'

for an Alliance phenotype record with an embedded array, something like

{
        "dateAssigned": "2006-10-25T18:06:17.000-05:00",
        "evidence": {
            "crossReference": {"id": "RGD:1357201", "pages": ["reference"]},
            "publicationId": "PMID:11549339",
        },
        "objectId": "RGD:61958",
        "phenotypeStatement": "cardiac hypertrophy",
        "phenotypeTermIdentifiers": [{"termId": "MP:0001625", "termOrder": 1}],
    }

Note that phenotypeTermIdentifiers designates an array of values to be parsed.

Hi @RichardBruskiewich - I'm going to start taking a look at this and may have an idea how to implement.

Would you happen to know of any particular examples of transforms that would require this feature?

Hi @glass-ships, as indicated above, it's the Alliance ingests which motivated this Issue. In fact, I coded an internal workaround for this inside the ingest (see the ingest code which calls the get_data() method which takes dot delimited tag paths as inputs).

Maybe the code in question can guide your Koza solution?

Awesome, thanks @RichardBruskiewich !! I'll take a look