INCATools/relation-graph

algorithm to make RG faster

Closed this issue · 3 comments

I suspect RG spends a lot of time working through combinations that can be eliminated ahead of time.

E.g. consider a large ontology consisting of protein classes P and go terms T, where the two are linked by P subClassOf involvedIn some T, where T employs a rich RBox such that there are many interesting R some Ts, but there are no named R some Ps. We waste a lot of time if we materialize all R some Ps, we can filter ahead of time and determine there is no named classes inferred to be subclasses.

The following algorithm builds a reduced set of class expressions Xs to materialize:

Cs = bfsort(Ont.classes)
Rs = bfsort(Ont.properties
Excluded = {}
Xs = []
For r in Rs:
  For c in Cs:
    if exists r', c' such that r' in {parent(r),r} and c' in {parent(c),c} and Excludes[(r',c')]:
      Excluded[(r,c)] = True
      Break
    Else:
      Excluded[(r,c}] = |Reasoner.queryDescendants( r some c )| > 0
      Xs.append( r some c )

Intuitively this will give better results in many cases, but may come at a penalty of doing extra computation when Xs approximates the full Cs x Rs cross product. Benchmarks required!

As an aside, I still think the strategy of materializing R-some-Cs is a bit contorted, I suspect you will always get better performance by modeling the RG directly in the ABox and using simple property chains/rules e.g. R o SubClassOf -> R together with transitivity.. initial experiments with datalog suggest this is the case but not clear if performance is due to the datalog system (Souffle) or the differing strategies (TBox vs ABox)

actually my algorithm can be easily improved. There is no need to visit all elements of the cross-product. Just do a breadth of depth first search, but don't search any deeper if there are no computed subsumers. doh!

L = [ (topObjectProperty, owlThing) ]
E = [ <c sub reasoner.ancestors(c)>  for c in O.classes ]
while |L| > 0:
  r, c = pop L
  D = Reasoner.queryDescendants( r some c )
  if |D| > 0
    E += [<d r c> for d in D]
    L += { directChildren(p) x directChildren(c) }

I am not sure on the exact mechanism every reasoner users to calculate descendants, but there are some obvious pre-checks that could be done, e.g using range constraints, to avoid unneccessary calculation

note currently RG takes a very long time with the union of NCBITaxon+RO. In fact the whole procedure is unneccessary here due to zero existentials. My algorithm above will return zero results for descendants of "topProp some Thing" and calculate zero dynamic queries

Something like your algorithm is now implemented in #27. There is still a problem with the full NCBITaxon+RO, which is more due to the Whelk reasoning algorithm than with relation-graph. It has to do with a join order that's baked into the algorithm which turns out to be terrible when you query r some C and C has a vast number of subclasses (e.g. checking "topProp some Thing").

Further performance improvements in #64.