INCATools/boomer

Using justification in posterior probability scoring

cmungall opened this issue · 1 comments

Currently, for P(A|H ) we assume a uniform probability, except in the case where the ontology O is incoherent.

We want to set P(A|H) be higher when the pre-existing axioms A are justified by the hypothetical axioms.

Consider:

A:
classes: cat, felis, mammal, mammalia
cat SubClassOf mammal
felis SubClassOf mammalia
H:
Pr(cat=felis) = 0.5
Pr(mammal=mammalia) = 0.5

(here we may be trying to align two terminologies, a formal and common one, but that is not strictly relevant for this example)

Under the existing boomer posterior probability calculation as specified in the kboom paper, all 4 solutions have equal posterior probability

intuitively we would like to "reward" the selection of { cat=felis, mammal=mammalia }, not just because of our prior knowledge or guesses based on labels, but on the fact that two hierarchies mutually support one another. The fact that cat isa mammal justiifies that felis isa mammalia when the two equivalence axioms are assumed.

conversely, consider

A:
classes: cat, felis, mammal, mammalia, octopus
cat SubClassOf mammal
H:
Pr(cat=octopus) = 0.5
Pr(mammal=mammalia) = 0.5

Again, using existing algorithm all 4 combos have equal posterior probability. However, here we want to weigh against the solution { cat=octopus, mammal=mammalia} -- not because of our prior knowledge, but because the fact there was no assertion that octopus is a mammalia. If we believe cat=octopus, then this entails an entirely new fact that was not asserted.

I'm open to ideas of how to incorporate this. I think the latter case may be faster compute. Just as we make a UNA for pre-populating implicit NotEquivalent axioms between classes in a single ontology.set, we can make a probabilistic OWA assumption, that that if an input sub-ontology does not entail an axiom (where the signature in the axiom is a subset of the sub-ontology signature), then we assign a low probability for that axiom. We might think of this intuitively as the alignment 'disrupting' an ontology by introducing new entailments.

For the former case, this can be posed in terms of the concept of Justification in the DL literature. This may be quite expensive to compute in the general case. See also ontodev/robot#528

A more efficient less complete solution would be to look for "justified squares":

d1 subClassOf[direct] c1
c1 = c2
d1 = d2
d2 subClassOf+ c2

entailed by H/A, but not entailed by A alone. (just calculate all justified squares from A in advance of running tree search and subtract this set).

I can't currently think of a principled way to go from this metric to P(A|H). If we only treat the final posterior probability as a ranking rather than absolute this is less important.

I think we can track new "within-namespace subsumptions" using a whelk plugin similar to the current within-namespace equivalence checker. We could prevent them entirely for a particular namespace, or count them. As you say, I'm not sure how to go from new within-namespace subsumptions to a probability adjustment.