oracle-samples/clara-rules

Proposal for efficient updates of facts in the session

Closed this issue · 6 comments

Currently, when we want to change a fact we have to retract the original version and then insert the modified version. This works, but is inefficient in cases where the modification to the fact will have limited impact on the session. In our case, one important problem here is that we have a ReferenceTime fact that contains the "current time" that is used in numerous rules in joins with data points that have data points on them; we then ask questions like "is this data point within the last year"? When we retract ReferenceTime and replace it with a ReferenceTime that is, for example, a few hours later as a practical matter we would not expect much actual impact on the session. Therefore, in principle, when we have a rule like

(defrule last-year-rule
     [ReferenceTime (= ?time time)]
     [?fact <- [FactWithTime (time-based-predicated? ?time time)]]
    =>
    (insert! (->FactThatCausesLotsOfRuleActivations ?fact)))

we should be able to avoid any work other than re-checking the join between the ReferenceTime fact and FactWithTime facts when the truth of the time-based-predicate? does not change due to the change in the time. However, in our current pattern, we completely retract the ReferenceTime and insert a new ReferenceTime. This causes a significant amount of work in the Rete network that shouldn't be necessary.

I've considered several approaches for reducing this work.

A. Create a special atom or other deref'able type whose changes would be detected by Clara and be reflected in the rule network. This would be a similar idea to the specialized atoms used in Reagent. This would have the advantage of being fairly simple to implement but would fundamentally introduce a new logical structure to rules that doesn't belong and would only be present for performance reasons. The appropriate usage of such a structure would be a confusing part of the API.

B. Retract the original facts, and then insert their replacements, but before propagating facts inserted due to activations by the replacements or sending retractions to the relevant alpha nodes, eliminate duplicates where the same fact was both inserted and retracted. I'm concerned that searching for such pairs would be too costly in large rules sessions though.

C. Propagate a pair of [old-fact replacement-fact] through the Rete network and stop further propagation once we can determine that the replacement will not impact the network state further and all memory modifications necessary to ensure the accuracy of future truth maintenance operations are complete.

Overall at this point I favor option C. It does not change the fundamental semantics of rules or rule design. It would require adding an update method to ISession but this seems like an easy to understand operation if one already understands insertions and retractions.

I have created a very preliminary prototype of option C for a limited subset of nodes (AlphaNode, RootJoinNode, and QueryNode) at WilliamParker@1590249. Note that are definitely things I'd want to clean up/fix in this before merging it aside from the lack of support for many nodes; my objective is just to give a concrete idea of the sort of enhancement I'm talking about.

I see at least the following possible optimizations:

  • When the bindings used by a ProductionNode didn't change as the result of an update, don't do anything as a result of the update. My expectation is that this would be the most important optimization.
  • When a node in the network in the beta network doesn't change the match or lack of one with a token due to the update of an element, and the condition that created the beta node doesn't have a result binding, don't propagate the update downstream. This would require not adding elements to the tokens propagated downstream in the existing paths, so I didn't focus on this one for the moment, but I think it could be done.
  • Updates of elements in accumulator and negation conditions would likely not need to be propagated in some cases, but I haven't fully reasoned through these cases yet.

@mrrodriguez @rbrush

I have created a working draft of this at WilliamParker@af97744. I ended up taking a somewhat different approach on this than described above. In terms of the API, instead of passing particular pairs of facts, the user just passes lists of facts to insert and retract with the expectation that they may cancel each other out. If the retractions are not present the effect is just to insert the insertions. In essence, the functional semantics of the new call to the engine are the same as inserting the insertions, retracting the retractions, and then firing the rules. The key difference is performance

In terms of implementation, the heart of it is that insertions and facts are used to cancel each other out in the :pending-updates in the engine at each iteration of the fire-rules* loop. I replaced the current storage of pending updates with an UpdateCache type that uses the same implementation for the current user-facing functions (insert, retract, and fire-rules), just behind a new type. As a result I do not expect noticeable impact on these calls. However, the new call uses a different CancellingUpdateCache that cancels out such updates. Since all facts that are added to the pending-updates are ultimately inserted or retracted, depending on the type of the update, and all facts in the pending-updates are flushed before any activations can be fired I don't see any scenarios where this approach would be functionally incorrect. This is also discussed some at #171 (comment) in the context of that issue.

I'm leaning against trying to directly propagate pairs of updates through the network at this point. There are some heuristics that could be used to improve efficiency there, but after kicking this around for some time I didn't find any general approach to dealing with the propagation of such pairs through accumulator and negation conditions. Cases where the join bindings are different are another complicating factor, which removes much of the engine's ability to batch propagation through the network. Furthermore, the heuristics I looked at all ended up adding significant complexity to the engine, while the complexity added by this approach is actually pretty minimal. Of course, if someone else sees such a general approach I'd be interested to hear it. :) It is possible that we could add such heuristics in the future if they turn out to be needed, but right now I'm inclined to go with the simple approach which still achieves the primary goal of preventing unbounded propagation through the Rete network when changes to facts don't impact rules. I also like that this approach doesn't require that the cancellation occur in the first batch of rules. If FactA causes FactB to be inserted and FactB causes FactC to be inserted, if FactA is modified in a way that modifies FactB but the modification of FactB does not impact FactC, in most scenarios there should be no downstream updates from FactC.

It is possible that this cancelling cache could be used elsewhere as well, but I'd be wary of the extra work to look for cancelling insertions/retractions causing a performance hit in cases where we don't expect much cancellation of operations anyway. In particular, this cache relies on hashing facts, and we've had use-cases before where hashing was a very expensive operation; hence the logic in the memory that relies on highly optimized linear sweeps that turned out to be faster than a hash-based lookup in real use cases. We could potentially make a cancelling update cache that used linear sweeps as well if we kept information on the production node ID that caused the update and used that as a key in a map to linear stores of updates per node. However, I think we can defer work on that until we have concrete use cases where facts that are expensive to hash would be used in this efficient update logic (our worst uses cases with expensive-to-hash facts wouldn't benefit from this efficient update logic.)

Regarding where to add this, my inclination is to add a user-facing function to call this method on LocalEngine in a "clara.rules.experimental" or some such namespace for the time being since this could end up evolving further with use on real data and rulesets in ways that would impact the API.

Thoughts welcome, both from Cerner colleagues and the community at large.

I haven't completely forgot about this issue. I have started looking through what you have and I think I have a bit of questions on how you see this being used/useful. However, I'll think about it a little more with some more concrete thoughts.

I've created a PR that addresses @mrrodriguez 's comments at #263.

I'll want to look this myself at least once more before merging but I'd like to get some thoughts on this.

Particular items to discuss:

  • Will the reordering of things in the :pending-updates cause anything to break? I don't think so, since once something is put in the pending-updates it will always be executed eventually and activations shouldn't be popped off again until all updates have been flushed but it is something to think about.
  • Thoughts on whether adding an experimental namespace like this is a good pattern.
  • Note that I didn't just use frequencies maps in the cancelling cache since that would have implications for facts with metadata. This would allow facts that are equal but with different metadata to cancel each other out, but that is fundamentally the same as the current behavior w.r.t. metadata in the memory, while, say, duplicating a fact and thus its metadata would be new behavior to my mind. There are also advantages to maintaining fact object identities for quick token sweeps on identity in the memory in later operations.
  • I don't anticipate any impacts on current functionality from this change, but if anyone sees potential for such impacts that would be something to discuss. The update cache is now a stateful object, but I replicated the current behavior of the default cache behind the wrapper. I'll plan to do some testing on this before a merge.
  • Currently this doesn't offer any user selection of the cache to use. That could potentially be done in the future if particular caches turned out to be useful for particular workloads but my inclination is to not go down that road until we have use cases for it.

In a comment on the draft at WilliamParker@af97744#commitcomment-20782287 @mrrodriguez suggested delaying the actual execution of the insert and retract operations until fire-rules is called and then only performing those operations once fire-rules is called. Going down the path used here could then be an option supplied to fire-rules. Making the API more general in this way would also facilitate any future options we might find it necessary to add and have the added benefit of reducing the number of transitions between transient and persistent memory. I'm leaning toward proceeding with that approach now. I wouldn't anticipate this impacting the fundamental design of the insertion and retraction cancelling here though; we'd basically go down the same code path we have in replace-facts when the option was provided to fire-rules.

In pseudocode, I'm envisioning something like

(deftype LocalSession [insertions-cache retractions-cache ...]
     (insert [facts]
            (LocalSession. (into insertions-cache facts) retractions-cache ...)))
      (retract [facts]
             (LocalSession. insertions-cache (into retractions-cache facts) ...))
      (fire-rules
              do-retractions-here
              do-insertions-here
              fire-rules-here
              (LocalSession. [] [] ....)))

My first thought would be to continue retracting everything upfront and then inserting in the non-cancellation path rather than placing retractions on the :pending-updates to maintain the performance characteristics of the existing code unchanged but there are other ways that things could potentially be reordered that could be discussed/considered.

I have created a new PR for this on top of the changes in #269 at #271. Fundamentally the changes are largely the same as for #263 but with the API changed to be an option given to fire-rules rather than a new top-level function.

It may need some more tests; that merits a bit more thought and I'll want to take another pass through this before merging in any case. I don't see foresee any major changes though and I think this is ready for review by others.

I'm sure why Travis isn't running for the PR.

This has been implemented in #271. I am going to close the issue. Any further work on this can be handled under new issues without the complication of the initial design discussions here.