JuliaSymbolics/Metatheory.jl

Equality saturation hash collision leads to unwanted merge of unequal eclasses

Closed this issue · 9 comments

I'm using the latest commits from ale/3.0 branch.

I use eq-sat to simplify the set of all arithmetic expressions that can be generated from a grammar (with limited length).
I have been struggling with seemingly random errors where eclasses are merged, even though they are not equal (as detected in the join method for semantic analysis and constant folding).

After detailed debugging, I think I found two VecExprs with different elements but the same hash value (note different arguments in the last two places). I believe this could be the reason for my problems.

julia> expr1 = UInt64[0xffd1740b5a40f64c, 0x0000000000000011, 0x4079ccad1b3303eb, 0x6b268fa6aca9af40, 0x000000000000006e, 0x00000000000000dc];
julia> expr2 = UInt64[0xffd1740b5a40f64c, 0x0000000000000011, 0x4079ccad1b3303eb, 0x6b268fa6aca9af40, 0x0000000000000037, 0x0000000000000037];
julia> Metatheory.v_hash!(expr1) == Metatheory.v_hash!(expr2)
true
julia> hash(expr1) == hash(expr2)
true

v_hash! calculates the hash value for the elements 2:end using Base.hash which in this case leads to a collision. The effect is that in

function add!(g::EGraph{ExpressionType,Analysis}, n::VecExpr, should_copy::Bool)::Id where {ExpressionType,Analysis}
  canonicalize!(g, n)
  h = IdKey(v_hash(n))

  haskey(g.memo, h) && return g.memo[h]
...

we may immediately return an existing eclass that has the same v_hash as the new expression but is not equivalent.

To solve this, it may be necessary to check if the two VecExprs match completely, after finding that the hash values match. But I'm not sure how central the comparison of v_hash values is for the whole equality saturation implementation.

It is difficult to produce a self-contained minimal example for this problem, as it only occurs when signatures and the ids of eclasses are such that the hash values collide in VecExpr. This is unlikely for a single expression but is certain to occur when simplifying millions of expressions.

After a more detailed code review, and comparison to egg, it looks like the issue can be solved by changing EGraph.memo to Dict{VecExpr,Id} Instead of Dict{IdKey,Id}. The required changes are minimal, I'll prepare a branch and test whether this solves my issue. Not sure about the performance implications (I saw the note about needing a faster dictionary implementation).

@gkronber hey! thanks! yep that's a hash collision. 🤯
v_hash is computed by checking n[2:end]

julia> hash([0x4079ccad1b3303eb, 0x6b268fa6aca9af40, 0x000000000000006e, 0x00000000000000dc])
0x9fcfea6d1a89f415

julia> hash([0x4079ccad1b3303eb, 0x6b268fa6aca9af40, 0x0000000000000037, 0x0000000000000037])
0x9fcfea6d1a89f415

There's a reason why EGraph.memo is a Dict{IdKey, Id}. Computing hashes millions of times per seconds is really expensive and took most of the flamegraph when profiling. That's a mechanism to cache.

If you change it to Dict{VecExpr, Id} it won't really work because it will call Base.hash on the VecExpr instead of v_hash. These special functions are defined in order to cache the hash in the first element of the VecExpr vector. Once you compute it, there's no need of re-computing it. Each modifying v_... operation to VecExprs invalidates the hash and re-sets it to 0. The hash is only recomputed iff its cached is 0.

Check how v_hash and its ! version are defined

"""
Compute the hash of a `VecExpr` and store it as the first element.
"""
@inline function v_hash!(n::VecExpr)::Id
  if iszero(n[1])
    n[1] = hash(@view n[2:end])
  else
    # h = hash(@view n[2:end])
    # @assert h == n[1]
    n[1]
  end
end

"""The hash of the e-node."""
@inline v_hash(n::VecExpr)::Id = @inbounds n[1]

I suggest checking in the Julia Base stdlib definition for Dict to see how they're handling hash collisions. I remember seeing issues about not handling them, but it seems they are now!

julia> v1 = [0x4079ccad1b3303eb, 0x6b268fa6aca9af40, 0x000000000000006e, 0x00000000000000dc];

julia> v2 = [0x4079ccad1b3303eb, 0x6b268fa6aca9af40, 0x0000000000000037, 0x0000000000000037];

julia> d = Dict{Vector{UInt64}, Int}()
Dict{Vector{UInt64}, Int64}()

julia> d[v1] = 1
1

julia> d[v2] = 2
2

julia> d
Dict{Vector{UInt64}, Int64} with 2 entries:
  [0x4079ccad1b3303eb, 0x6b268fa6aca9af40, 0x000000000000006e, 0x000000000 => 1
  [0x4079ccad1b3303eb, 0x6b268fa6aca9af40, 0x0000000000000037, 0x000000000 => 2

julia> d[v1]
1

julia> d[v2]
2

julia> hash(v1)
0x9fcfea6d1a89f415

julia> hash(v2)
0x9fcfea6d1a89f415

There's a reason why EGraph.memo is a Dict{IdKey, Id}. Computing hashes millions of times per seconds is really expensive and took most of the flamegraph when profiling. That's a mechanism to cache.

If you change it to Dict{VecExpr, Id} it won't really work because it will call Base.hash on the VecExpr instead of v_hash. These special functions are defined in order to cache the hash in the first element of the VecExpr vector. Once you compute it, there's no need of re-computing it. Each modifying v_... operation to VecExprs invalidates the hash and re-sets it to 0. The hash is only recomputed iff its cached is 0.

I defined Base.hash(e::VecExpr) = v_hash(e).

Metatheory unit test all still work. I'm currently testing it for my code. This will take a while.

Base.hash(e::VecExpr) = v_hash(e). Sadly it's piracy as const VecExpr = Vector{UInt64}.

I created a new composite type for VecExpr that just wraps Vector{Id}.
Changes are in this branch: https://github.com/gkronber/Metatheory.jl/tree/225_hash_collision
The unit tests pass. The collision does not occur when called from my code anymore.

The benchmarking script shows no difference.
New code (0fd0c91):

julia> run(SUITE)
5-element BenchmarkTools.BenchmarkGroup:
  tags: []
  "prop_logic" => 4-element BenchmarkTools.BenchmarkGroup:
          tags: ["egraph", "logic"]
          "freges_theorem" => Trial(1.183 ms)
          "prove1" => Trial(41.253 ms)
          "demorgan" => Trial(111.700 μs)
          "rewrite" => Trial(118.100 μs)
  "basic_maths" => 2-element BenchmarkTools.BenchmarkGroup:
          tags: ["egraphs"]
          "simpl2" => Trial(26.659 ms)
          "simpl1" => Trial(6.358 ms)
  "while_superinterpreter" => 1-element BenchmarkTools.BenchmarkGroup:
          tags: ["egraph"]
          "while_10" => Trial(22.277 ms)
  "egraph" => 2-element BenchmarkTools.BenchmarkGroup:
          tags: ["egraphs"]
          "addexpr" => Trial(5.190 ms)
          "constructor" => Trial(592.614 ns)
  "calc_logic" => 2-element BenchmarkTools.BenchmarkGroup:
          tags: ["egraph", "logic"]
          "freges_theorem" => Trial(47.019 ms)
          "demorgan" => Trial(86.500 μs)

For current code 7709c90:

julia> run(SUITE)
5-element BenchmarkTools.BenchmarkGroup:
  tags: []
  "prop_logic" => 4-element BenchmarkTools.BenchmarkGroup:
          tags: ["egraph", "logic"]
          "freges_theorem" => Trial(1.166 ms)
          "prove1" => Trial(41.580 ms)
          "demorgan" => Trial(107.500 μs)
          "rewrite" => Trial(112.500 μs)
  "basic_maths" => 2-element BenchmarkTools.BenchmarkGroup:
          tags: ["egraphs"]
          "simpl2" => Trial(25.059 ms)
          "simpl1" => Trial(6.043 ms)
  "while_superinterpreter" => 1-element BenchmarkTools.BenchmarkGroup:
          tags: ["egraph"]
          "while_10" => Trial(21.372 ms)
  "egraph" => 2-element BenchmarkTools.BenchmarkGroup:
          tags: ["egraphs"]
          "addexpr" => Trial(4.827 ms)
          "constructor" => Trial(620.994 ns)
  "calc_logic" => 2-element BenchmarkTools.BenchmarkGroup:
          tags: ["egraph", "logic"]
          "freges_theorem" => Trial(44.582 ms)
          "demorgan" => Trial(85.800 μs)

Please check the changes. If you feel ok with them, I'll create a PR (with all changes squashed to a single one). Let me know if there are other benchmarks that I can run.

Thanks! Yeah please go ahead! That's an amazing fix.

Glad to see it didn't slow things down.

@gkronber can you please open a PR with the changes?