owlcs/owlapi

[OBO parser] Axiom annotations using arbitrary properties are not always properly translated

Closed this issue · 3 comments

Since PR #1099 (present in OWL API 4.5.26), it is supposedly possible to use arbitrary annotation properties in OBO “qualifier tags”, e.g. one can do this:

is_a: TEST:123 {PROP:456="lorem ipsum"}

to annotate the SubClassOf axiom with a http://purl.obolibrary.org/obo/PROP_456 property.

However, this does not always work. For example, in Uberon, the IAO:0000116 tag is most of the times translated correctly as http://purl.obolibrary.org/obo/IAO_0000116, but sometimes translated (incorrectly) as http://www.geneontology.org/formats/oboInOwl#0000116 — which among other consequences makes the ontology impossible to serialise in RDF/XML, because the RDF/XML writer tries to write the annotation as a <oboInOwl:0000116> XML tag (which it cannot do because 0000116 is not a valid XML tag name).

The root cause of the issue lies in the oboIdToIRI_load() method, which I believe does not behave as intended. When called with the oboInOwlDefault argument set to true, it completely ignores the existing prefix and forcibly translate the Curie into the http://www.geneontology.org/formats/oboInOwl# namespace (lines 1729–1731).

That is, oboIdToIRI_load("IAO:0000116", true) will return http://www.geneontology.org/formats/oboInOwl#0000116.

I don’t see how this can be the desired behaviour. The oboInOwlDefault argument should only have an effect in cases where the tag is not qualified (e.g., to translate {source="lorem ipsum"} into http://www.geneontology.org/formats/oboInOwl#source). When the tag includes an explicit prefix (e.g. {IAO:0000116="lorem ipsum"}), the prefix should always be taken into account regardless of the oboInOwlDefault value.

To make things worse, the issue is in many cases hidden by the use of the idToIRICache. When translating any OBO ID to an IRI, that cache is queried before any call to oboIdToIRI_load(). But importantly, the idiToIRICache does not grow indefinitely: when it reaches a certain size, older entries are removed from it.

I believe what happens in the aforementioned Uberon issue is something along the lines of:

  1. The OBO parser processes the editor_note type definition, which contains a cross-reference to IAO:0000116. The ID is translated correctly (the oboIdToIRI_load method is called with oboInOwlDefault set to false in this context) and put into the ID cache. This always happens first, because even though type definitions are physically located at the end of an OBO file, the Typedef frames are always translated before the Term frames.
  2. When processing a Term frame that contains a IAO:0000116="..." qualifier, the cache is queried for the IAO:0000116 ID. At the beginning, the cache will still contain that ID and will therefore return the correct IRI.
  3. But at some point, provided that the ontology is large enough, the IAO:0000116 ID will be dropped from the cache. Then the next time the parser will have to translate a IAO:0000116="..." qualifier, the cache will miss, and oboIdToIRI_load will be called to perform the translation with oboInOwlDefault set to true, thereby resulting in the ID being incorrectly translated into the http://www.geneontology.org/formats/oboInOwl# namespace.

This would explain why only some IAO:0000116 annotations are incorrectly translated. This would also explain why the unit test included in PR #1099 didn’t catch the issue: when the parser needs to translate this bit:

def: "Definition of term two." [] {MYONT:20="A nested annotation value."}

the MYONT:20 ID is still in the ID cache (associated with the correct IRI) because it has been correctly translated when processing the source type definition:

[Typedef]
id: source
name: source
xref: MYONT:20

and the test file is not large enough for the ID to have been removed from the cache.

@gouttegd thank you so much for doing this analysis, and sorry for apparently not adequately testing this change before the release! It looks like you're already implementing some fixes. Let me know if I can help. I will be away at a project meeting most of this week, so I probably can't do much over the next few days.

Given the fact that the ID cache hides the problem away in most cases, I don’t think any amount of testing on small files would have caught the issue. We only found it when using the new ROBOT on a “real world” file (Uberon).

I believe I indeed have a fix ready, but I want to test it intensively before we go through the whole cycle of releasing a new OWL API, a new ROBOT, a new ODK, and a new Protégé.

@gouttegd thanks for the excellent analysis and the pull request. Keeping this open to update the other branches.