Question about self join

Question

Question about self join

StarGazerM opened this issue a year ago · 1 comments

In the compiled RAM of query contains self join for example:

path(from, to) :- edge(from, to).
path(from, to) :- path(from, mid), path(mid, to).

Recursive rule will be compiled into:

LOOP
   DEBUG "path(from,to) :- \n   path(from,mid),\n   path(mid,to).\nin file tc.dl [10:1-10:50]"
    QUERY
     IF ((NOT ISEMPTY(@delta_path)) AND (NOT ISEMPTY(path)))
      FOR t0 IN @delta_path
       FOR t1 IN path ON INDEX t1.0 = t0.1
        IF ((NOT (t0.0,t1.1) IN path) AND (NOT (t0.1,t1.1) IN @delta_path))
         INSERT (t0.0, t1.1) INTO @new_path
    END QUERY
   END DEBUG
   DEBUG "path(from,to) :- \n   path(from,mid),\n   path(mid,to).\nin file tc.dl [10:1-10:50]"
    QUERY
     IF ((NOT ISEMPTY(path)) AND (NOT ISEMPTY(@delta_path)))
      FOR t0 IN path
       FOR t1 IN @delta_path ON INDEX t1.0 = t0.1
        IF (NOT (t0.0,t1.1) IN path)
         INSERT (t0.0, t1.1) INTO @new_path
    END QUERY
   END DEBUG

The first RAM query is joining delta version of path with full version path and the second is reverse.
I have 2 question:

why there is a (NOT (t0.1,t1.1) IN @delta_path)) in first query
why the self join is not simple as join delta with delta and then delta with full? this should also contains all result, iterate over full seems unnecessary?

Answer 1 · 2024-02-13T17:27:13.000Z

The rules are evaluating semi-naive step i+1. The test NOT (t0.1,t1.1) IN @delta_path is NOT (mid,to) IN @delta_path, it is a way to simulate path at step i-1.

I reckon it is confusing to have to mix values from t0 and t1 but it's probably due to lowering and t1.0 = t0.1. It is equivalent to (NOT (t1.0, t1.1) IN @delta_path.

If you are wondering why we need path at step i-1 during evaluation of step i+1, it's an optimization of the semi-naive evaluation well described in the literature. R <- R1 ... Rn, T1 ... Tm is replaced with tempR(i+1) <- R1 ... Rn, T1(i) .. Tj-1(i), ΔTj(i), Tj+1(i-1) ... Tm(i-1) for j in 1..m.

This definition requires the relations T at steps i and i-1. Since we don't want to keep history of 2 steps in Soufflé, we only store T at step i and we say T(i-1) = T(i) - Δ(i).

There is a second optimization here, Soufflé does not compute tempR(i+1) instead it computes the ΔR(i+1) (that we call @new) by making sure we are not inserting facts that already exist in R(i): (NOT (from, to) IN path).

Eventually at the end of each LOOP iteration, Soufflé computes R(i+1) = R(i) + ΔR(i+1):

QUERY
    FOR t0 IN @new_anc
       INSERT (t0.0, t0.1) INTO anc
END QUERY