required_properties for nested JSON files tags
Closed this issue · 3 comments
In some Koza configuration YAML files, like some specifying parsing of Alliance schema compliant input data, we need a way to specify nested JSON tags as required properties, e.g.
required_properties:
- metaData.crossReference.id
- geneId
- whenExpressed.stageTermId
- assay
- whereExpressed.cellularComponentTermId
- whereExpressed.anatomicalStructureTermId
- crossReference.id
- evidence.publicationId
given a typical Alliance schema-driven gene expression site data file entry as follows:
{
"dateAssigned": "2022-01-21T07:09:02-08:00",
"geneId": "ZFIN:ZDB-GENE-031222-3",
"evidence": {
"crossReference": {"id": "ZFIN:ZDB-PUB-080616-21", "pages": ["reference"]},
"publicationId": "PMID:18544660",
},
"crossReference": {
"id": "ZFIN:ZDB-FIG-080908-4",
"pages": ["gene/expression/annotation/detail"],
},
"assay": "MMO:0000655",
"whenExpressed": {
"stageName": "Larval:Protruding-mouth",
"stageTermId": "ZFS:0000035",
"stageUberonSlimTerm": {"uberonTerm": "post embryonic, pre-adult"},
},
"whereExpressed": {
"whereExpressedStatement": "whole organism",
"anatomicalStructureTermId": "ZFA:0001094",
"anatomicalStructureUberonSlimTermIds": [{"uberonTerm": "Other"}],
},
}
Unfortunately, such a specification does not yet work, giving the following error:
ValueError: Required properties defined for alliance_gene_to_expression are missing from data\alliance\EXPRESSION_SGD.json.gz
Missing properties: {'evidence.publicationId', 'whenExpressed.stageTermId', 'crossReference.id', 'whereExpressed.anatomicalStructureTermId', 'whereExpressed.cellularComponentTermId', 'metaData.crossReference.id'}
Row: ...
Or perhaps (as a second exemplar):
required_properties:
- 'objectId'
- 'phenotypeTermIdentifiers'
- 'evidence'
- 'phenotypeTermIdentifiers[0]['termId']
- 'evidence.publicationId'
for an Alliance phenotype record with an embedded array, something like
{
"dateAssigned": "2006-10-25T18:06:17.000-05:00",
"evidence": {
"crossReference": {"id": "RGD:1357201", "pages": ["reference"]},
"publicationId": "PMID:11549339",
},
"objectId": "RGD:61958",
"phenotypeStatement": "cardiac hypertrophy",
"phenotypeTermIdentifiers": [{"termId": "MP:0001625", "termOrder": 1}],
}
Note that phenotypeTermIdentifiers
designates an array of values to be parsed.
Hi @RichardBruskiewich - I'm going to start taking a look at this and may have an idea how to implement.
Would you happen to know of any particular examples of transforms that would require this feature?
Hi @glass-ships, as indicated above, it's the Alliance ingests which motivated this Issue. In fact, I coded an internal workaround for this inside the ingest (see the ingest code which calls the get_data() method which takes dot delimited tag paths as inputs).
Maybe the code in question can guide your Koza solution?
Awesome, thanks @RichardBruskiewich !! I'll take a look