JuliaSymbolics/Metatheory.jl

Exploding egraphs - Troubles defining rule sets working with eq-sat

gkronber opened this issue · 16 comments

I struggle to define a set of rules that does not lead to exponentially growing egraphs which never saturate for simple expressions (Metatheory v2.0.2).

In the following code I try to fold constants. The rule set is designed to trigger the issue I observe:

    t = @theory a b c begin
        (a + b) + c == a + (b + c)
        a - b --> a + -1.0 * b  # this rule triggers the problem (together with the first rule)
        a::Number + b::Number => a + b
        a::Number * b::Number => a * b
    end

   g = EGraph(:(1 + 1 - 1))
   param = SaturationParams(timeout=20)
   report = saturate!(g, t, param); 
   println(report)

In Iteration two a loop for an e-class is created which is later expanded in each iteration.

┌ Debug: ================ EQSAT ITERATION 4  ================
└ @ Metatheory.EGraphs ~/.julia/dev/Metatheory/src/EGraphs/saturation.jl:290
┌ Debug: SEARCHING
└ @ Metatheory.EGraphs ~/.julia/dev/Metatheory/src/EGraphs/saturation.jl:115
┌ Debug: Rule 1: (~a + ~b) + ~c == ~a + (~b + ~c) produced 10 matches
└ @ Metatheory.EGraphs ~/.julia/dev/Metatheory/src/EGraphs/saturation.jl:129
┌ Debug: Rule 2: ~a - ~b --> ~a + -1.0 * ~b produced 1 matches
└ @ Metatheory.EGraphs ~/.julia/dev/Metatheory/src/EGraphs/saturation.jl:129
┌ Debug: Rule 3: ~(a::Number) + ~(b::Number) => a + b produced 4 matches
└ @ Metatheory.EGraphs ~/.julia/dev/Metatheory/src/EGraphs/saturation.jl:129
┌ Debug: Rule 4: ~(a::Number) * ~(b::Number) => a * b produced 1 matches
└ @ Metatheory.EGraphs ~/.julia/dev/Metatheory/src/EGraphs/saturation.jl:129
┌ Debug: APPLYING 16 matches
└ @ Metatheory.EGraphs ~/.julia/dev/Metatheory/src/EGraphs/saturation.jl:210
┌ Debug: 1
└ @ Metatheory.EGraphs ~/.julia/dev/Metatheory/src/EGraphs/saturation.jl:267
┌ Debug: ================ EQSAT ITERATION 5  ================
└ @ Metatheory.EGraphs ~/.julia/dev/Metatheory/src/EGraphs/saturation.jl:290
┌ Debug: SEARCHING
└ @ Metatheory.EGraphs ~/.julia/dev/Metatheory/src/EGraphs/saturation.jl:115
┌ Debug: Rule 1: (~a + ~b) + ~c == ~a + (~b + ~c) produced 98 matches
└ @ Metatheory.EGraphs ~/.julia/dev/Metatheory/src/EGraphs/saturation.jl:129
┌ Debug: Rule 2: ~a - ~b --> ~a + -1.0 * ~b produced 1 matches
└ @ Metatheory.EGraphs ~/.julia/dev/Metatheory/src/EGraphs/saturation.jl:129
┌ Debug: Rule 3: ~(a::Number) + ~(b::Number) => a + b produced 12 matches
└ @ Metatheory.EGraphs ~/.julia/dev/Metatheory/src/EGraphs/saturation.jl:129
┌ Debug: Rule 4: ~(a::Number) * ~(b::Number) => a * b produced 1 matches
└ @ Metatheory.EGraphs ~/.julia/dev/Metatheory/src/EGraphs/saturation.jl:129
┌ Debug: APPLYING 112 matches
└ @ Metatheory.EGraphs ~/.julia/dev/Metatheory/src/EGraphs/saturation.jl:210
┌ Debug: 1
└ @ Metatheory.EGraphs ~/.julia/dev/Metatheory/src/EGraphs/saturation.jl:267

After two iterations:
image

After three iterations:
image

The default scheduler is the BackoffScheduler and it starts to ban some rules after a few iterations. However, I feel I should be able to solve this without relying on the scheduler.

The problem does not occur for the expression "1+1+1" . What do I miss?

Can you please try on this branch? #185

The debugging info should also be a lot better.

Yes, there is indeed more debugging info (output for timeout=20).
metatheory-debug-log-subtraction-simplification.txt

It seems the problem remains (if it is indeed a problem and not a misunderstanding on my side), as the graph is expanded to over 3000 nodes and saturation stops because of timeout.

If it is helpful, I could try to run this with egg on rust. I refrained to do so for now, because I would first need to setup everything on the rust side.

SaturationReport


=================
        Stop Reason: timeout
        Iterations: 20
        EGraph Size: 150 eclasses, 3203 nodes
 ────────────────────────────────────────────────────────────────────
                            Time                    Allocations
                   ───────────────────────   ────────────────────────
 Tot / % measured:      308ms /  30.8%           41.6MiB /  65.1%

 Section   ncalls     time    %tot     avg     alloc    %tot      avg
 ────────────────────────────────────────────────────────────────────
 Search        20   66.3ms   70.0%  3.31ms   21.1MiB   77.8%  1.05MiB
   1           20   35.8ms   37.8%  1.79ms   17.2MiB   63.5%   881KiB
   3           20   10.5ms   11.0%   523μs   1.63MiB    6.0%  83.7KiB
   2           20   10.2ms   10.8%   511μs   1.02MiB    3.8%  52.2KiB
   4           20   8.02ms    8.5%   401μs    945KiB    3.4%  47.2KiB
 Apply         20   18.4ms   19.4%   920μs   3.94MiB   14.6%   202KiB
 Rebuild       20   10.0ms   10.6%   500μs   2.06MiB    7.6%   106KiB
 ────────────────────────────────────────────────────────────────────
    Updating git-repo `/home/gkronber/.julia/dev/Metatheory`
   Resolving package versions...
    Updating `~/.julia/environments/v1.10/Project.toml`
  [e9d8d322] ~ Metatheory v2.0.2 `~/.julia/dev/Metatheory` ⇒ v2.0.2 `../../dev/Metatheory#ale/3.0`
    Updating `~/.julia/environments/v1.10/Manifest.toml`
  [e9d8d322] ~ Metatheory v2.0.2 `~/.julia/dev/Metatheory` ⇒ v2.0.2 `../../dev/Metatheory#ale/3.0`
Precompiling project...
  1 dependency successfully precompiled in 3 seconds. 275 already precompiled. 1 skipped during auto due to previous errors.

I tried to run the example with egg in Rust by adapting https://github.com/JuliaSymbolics/Metatheory.jl/tree/master/scratch.

I was not successful so far as egg does not seem to have the concept of a DynamicRule. Instead they use an analysis for constant folding (https://github.com/egraphs-good/egg/blob/main/tests/math.rs)

I will try to reproduce as soon as I can (im working RN). Is it super urgent?

Not super urgent. If you have an idea that I could investigate myself or a hint which parts of the code I should look at that would be helpful.

Not super urgent. If you have an idea that I could investigate myself or a hint which parts of the code I should look at that would be helpful.

https://web.archive.org/web/20221226130645/https://github.com/egraphs-good/egg/discussions/60

This is an interesting discussion. It may be possible that the system is not terminating but I would have to investigate.
https://effect.systems/doc/egraphs-2023-theory/paper.pdf

  1. what happens when you extract?
  2. what happens when you pass saturate!(g, SaturationParams(scheduler=SimpleScheduler)) ? It should avoid the exponential backoff heuristic, it'd be interesting to see if the e-graph also explodes.

Does it terminate with the simple scheduler with a high iteration number and e-class limit?

Associativity can be really nasty. It's usually associative-commutative-distributive rules that make e-graphs explode. I would like to explore theoretically how one can design a special kind of e-graph, with a special kind of e-node that is "associative", e.g. Node(:+, 1, 2, 3) is implicitly equal to all the possible ways of associating 1,2,3 without the need of an e-class. It's not an easy task

Thanks for pulling up this egg discussion. In fact, the very first "Nonterminating Example" with "associativity combined with annihilation" is more or less the same as I have posted here.

Not super urgent. If you have an idea that I could investigate myself or a hint which parts of the code I should look at that would be helpful.

It's mostly the entire EGraphs directory... Could be an issue with e-matching (unlikely), could be an issue with substitutions (term creation), or could be an issue with saturation itself. Could be the nature of the rewrite system that is non-terminating in equality saturation.

What I suggest for more hardcore debugging is to draw an e-graph on a piece of paper and to the saturation step by step by hand, comparing it with one run with the simple scheduler, starting at super low iteration limit. Add some prints in saturate! and eqsat_step! (mostly apply phase), to show every term that is inserted in the e-graph.

Then you can see if it's inserting all the terms that are expected, or if it's inserting something extra.

Please feel free to open a PR to the 3.0 branch if you find anything wrong, or in case you would just like to help out. Thanks a lot for the interest!

I have continued to work on this, and I would say the issue is solved for me. So feel free to close this issue if you like.

The following might be helpful for other users like me.

  • Correction: I stated above that egg does not have dynamic rules. This is obviously wrong, after reading the egg paper. In fact, dynamic rules in combination with semantic analysis are an important feature of egg.
  • The main reason for the performance problems, was my incorrect implementation and understanding of analysis values. The e-graph can easily grow to thousands of nodes, and the BackoffScheduler helps to mitigate rules that grow the e-graph too quickly. This is still fast. The creation and handling of analysis values can become slow, because it is your duty to implement make / join / modify / and equality correctly and efficiently (see below).
  • Do not create an analysis value for each e-node (in make). Only create an analysis value if a fact about the e-node / e-graph can be asserted. Otherwise return nothing to save memory allocations.
  • Do not use a mutable composite type for the analysis values. Analysis values must be immutable. My type was an immutable struct which however contained an Expr type (which is itself mutable). This implicitly means that equality of analysis values is based on comparison of references instead of values which can be surprising. If your analysis type contains a mutable type, make sure you implement Base.== for your analysis type to implement value-type comparison semantics.
  • Similarly if your analysis type contains Float values, implement Base.== to make sure that NaN == NaN. For example, in constant folding you might produce NaN values, and by default NaN != NaN. The saturation algorithm triggers a join operation for all parent nodes when it detects a changed analysis value (old != new). If the equality predictate (==) does not work correctly for your analysis types, this can lead to severe performance degradation if your graph is large, and potentially infinite loops if it has loops. It also triggers many calls to make which leads to many allocations.
  • Make sure that the join operation is associative and handles nothing values. The egg paper is jargon heavy. Not every programmer knowns what a "join semi-lattice" is (https://en.wikipedia.org/wiki/Semilattice). However, to implement semantic analysis with equality saturation correctly, it is important to understand exactly what this means, as the algorithm relies on the join semilattice semantics. By merging more enodes into an eclass over the iterations, you might be able to establish more analysis facts about the eclass. However, the order in which the enodes are added to the eclass must not matter. Depending on the concrete semantics, joining might mean that you disallow joining different values (e.g. in constant folding, it is an error if the two analysis values have different values, in the even/odd example in the tutorials, an eclass only be even or odd). But if you e.g. track the size of the smallest expression for the eclass then it might be possible that over time you find smaller values. When joining the size values, you are allowed to return the minimum of the two values.
  • The code for equality saturation in Metatheory has two assertions in the rebuild! function which are commented. Activate these two assertions to find problems with your implementation of join / make
  • Annotate your implementations of make / join / modify with TimerOutputs. Too many calls or allocations in these functions are an indicator of a problem with the implementation of your analysis type.

I have also fixed a few obvious mistakes in the ale/3.0 branch in my fork (branch 3.0-fixes) https://github.com/gkronber/Metatheory.jl/tree/3.0_fixes . Feel free to use the fixes or I can create a PR if you want. Let me know if I should split up the changes into multiple PR.

us mistakes in the ale/3.0 branch

A PR would be really appreciated!! Also, I found the comments above quite useful, so feel free to include them in the docs (maybe a new .md file with "troubleshooting tips-and-tricks"). Thanks a lot! And I'm glad the issue was fixed.

I'm curious about what you're working on, please send me a mail at alessandro -at- cheli -dot- dev

I've opened a PR #191

Concerning the docs I suggest to create a new example with a new analysis example. I'll do this in a separate branch.

Concerning the docs I suggest to create a new example with a new analysis example. I'll do this in a separate branch.

Thanks, that would be really nice.

The test files in test/tutorials are rendered and included in the docs via Literate.jl

So if you add a test file with an example there, and write some markdown in the docs it'll be instantly included in the docs, in the style of https://juliasymbolics.github.io/Metatheory.jl/dev/tutorials/while_interpreter/

Closing the issue. Thanks a lot for your contribution