Equality saturation hash collision leads to unwanted merge of unequal eclasses
Closed this issue · 9 comments
I'm using the latest commits from ale/3.0
branch.
I use eq-sat to simplify the set of all arithmetic expressions that can be generated from a grammar (with limited length).
I have been struggling with seemingly random errors where eclasses are merged, even though they are not equal (as detected in the join method for semantic analysis and constant folding).
After detailed debugging, I think I found two VecExpr
s with different elements but the same hash value (note different arguments in the last two places). I believe this could be the reason for my problems.
julia> expr1 = UInt64[0xffd1740b5a40f64c, 0x0000000000000011, 0x4079ccad1b3303eb, 0x6b268fa6aca9af40, 0x000000000000006e, 0x00000000000000dc];
julia> expr2 = UInt64[0xffd1740b5a40f64c, 0x0000000000000011, 0x4079ccad1b3303eb, 0x6b268fa6aca9af40, 0x0000000000000037, 0x0000000000000037];
julia> Metatheory.v_hash!(expr1) == Metatheory.v_hash!(expr2)
true
julia> hash(expr1) == hash(expr2)
true
v_hash!
calculates the hash value for the elements 2:end
using Base.hash
which in this case leads to a collision. The effect is that in
function add!(g::EGraph{ExpressionType,Analysis}, n::VecExpr, should_copy::Bool)::Id where {ExpressionType,Analysis}
canonicalize!(g, n)
h = IdKey(v_hash(n))
haskey(g.memo, h) && return g.memo[h]
...
we may immediately return an existing eclass that has the same v_hash
as the new expression but is not equivalent.
To solve this, it may be necessary to check if the two VecExprs match completely, after finding that the hash values match. But I'm not sure how central the comparison of v_hash values is for the whole equality saturation implementation.
It is difficult to produce a self-contained minimal example for this problem, as it only occurs when signatures and the ids of eclasses are such that the hash values collide in VecExpr. This is unlikely for a single expression but is certain to occur when simplifying millions of expressions.
After a more detailed code review, and comparison to egg, it looks like the issue can be solved by changing EGraph.memo to Dict{VecExpr,Id}
Instead of Dict{IdKey,Id}
. The required changes are minimal, I'll prepare a branch and test whether this solves my issue. Not sure about the performance implications (I saw the note about needing a faster dictionary implementation).
@gkronber hey! thanks! yep that's a hash collision. 🤯
v_hash
is computed by checking n[2:end]
julia> hash([0x4079ccad1b3303eb, 0x6b268fa6aca9af40, 0x000000000000006e, 0x00000000000000dc])
0x9fcfea6d1a89f415
julia> hash([0x4079ccad1b3303eb, 0x6b268fa6aca9af40, 0x0000000000000037, 0x0000000000000037])
0x9fcfea6d1a89f415
There's a reason why EGraph.memo
is a Dict{IdKey, Id}
. Computing hashes millions of times per seconds is really expensive and took most of the flamegraph when profiling. That's a mechanism to cache.
If you change it to Dict{VecExpr, Id}
it won't really work because it will call Base.hash
on the VecExpr
instead of v_hash
. These special functions are defined in order to cache the hash in the first element of the VecExpr
vector. Once you compute it, there's no need of re-computing it. Each modifying v_...
operation to VecExpr
s invalidates the hash and re-sets it to 0. The hash is only recomputed iff its cached is 0.
Check how v_hash
and its ! version are defined
"""
Compute the hash of a `VecExpr` and store it as the first element.
"""
@inline function v_hash!(n::VecExpr)::Id
if iszero(n[1])
n[1] = hash(@view n[2:end])
else
# h = hash(@view n[2:end])
# @assert h == n[1]
n[1]
end
end
"""The hash of the e-node."""
@inline v_hash(n::VecExpr)::Id = @inbounds n[1]
I suggest checking in the Julia Base stdlib definition for Dict
to see how they're handling hash collisions. I remember seeing issues about not handling them, but it seems they are now!
julia> v1 = [0x4079ccad1b3303eb, 0x6b268fa6aca9af40, 0x000000000000006e, 0x00000000000000dc];
julia> v2 = [0x4079ccad1b3303eb, 0x6b268fa6aca9af40, 0x0000000000000037, 0x0000000000000037];
julia> d = Dict{Vector{UInt64}, Int}()
Dict{Vector{UInt64}, Int64}()
julia> d[v1] = 1
1
julia> d[v2] = 2
2
julia> d
Dict{Vector{UInt64}, Int64} with 2 entries:
[0x4079ccad1b3303eb, 0x6b268fa6aca9af40, 0x000000000000006e, 0x000000000… => 1
[0x4079ccad1b3303eb, 0x6b268fa6aca9af40, 0x0000000000000037, 0x000000000… => 2
julia> d[v1]
1
julia> d[v2]
2
julia> hash(v1)
0x9fcfea6d1a89f415
julia> hash(v2)
0x9fcfea6d1a89f415
There's a reason why
EGraph.memo
is aDict{IdKey, Id}
. Computing hashes millions of times per seconds is really expensive and took most of the flamegraph when profiling. That's a mechanism to cache.If you change it to
Dict{VecExpr, Id}
it won't really work because it will callBase.hash
on theVecExpr
instead ofv_hash
. These special functions are defined in order to cache the hash in the first element of theVecExpr
vector. Once you compute it, there's no need of re-computing it. Each modifyingv_...
operation toVecExpr
s invalidates the hash and re-sets it to 0. The hash is only recomputed iff its cached is 0.
I defined Base.hash(e::VecExpr) = v_hash(e).
Metatheory unit test all still work. I'm currently testing it for my code. This will take a while.
Base.hash(e::VecExpr) = v_hash(e)
. Sadly it's piracy as const VecExpr = Vector{UInt64}
.
I created a new composite type for VecExpr that just wraps Vector{Id}
.
Changes are in this branch: https://github.com/gkronber/Metatheory.jl/tree/225_hash_collision
The unit tests pass. The collision does not occur when called from my code anymore.
The benchmarking script shows no difference.
New code (0fd0c91):
julia> run(SUITE)
5-element BenchmarkTools.BenchmarkGroup:
tags: []
"prop_logic" => 4-element BenchmarkTools.BenchmarkGroup:
tags: ["egraph", "logic"]
"freges_theorem" => Trial(1.183 ms)
"prove1" => Trial(41.253 ms)
"demorgan" => Trial(111.700 μs)
"rewrite" => Trial(118.100 μs)
"basic_maths" => 2-element BenchmarkTools.BenchmarkGroup:
tags: ["egraphs"]
"simpl2" => Trial(26.659 ms)
"simpl1" => Trial(6.358 ms)
"while_superinterpreter" => 1-element BenchmarkTools.BenchmarkGroup:
tags: ["egraph"]
"while_10" => Trial(22.277 ms)
"egraph" => 2-element BenchmarkTools.BenchmarkGroup:
tags: ["egraphs"]
"addexpr" => Trial(5.190 ms)
"constructor" => Trial(592.614 ns)
"calc_logic" => 2-element BenchmarkTools.BenchmarkGroup:
tags: ["egraph", "logic"]
"freges_theorem" => Trial(47.019 ms)
"demorgan" => Trial(86.500 μs)
For current code 7709c90:
julia> run(SUITE)
5-element BenchmarkTools.BenchmarkGroup:
tags: []
"prop_logic" => 4-element BenchmarkTools.BenchmarkGroup:
tags: ["egraph", "logic"]
"freges_theorem" => Trial(1.166 ms)
"prove1" => Trial(41.580 ms)
"demorgan" => Trial(107.500 μs)
"rewrite" => Trial(112.500 μs)
"basic_maths" => 2-element BenchmarkTools.BenchmarkGroup:
tags: ["egraphs"]
"simpl2" => Trial(25.059 ms)
"simpl1" => Trial(6.043 ms)
"while_superinterpreter" => 1-element BenchmarkTools.BenchmarkGroup:
tags: ["egraph"]
"while_10" => Trial(21.372 ms)
"egraph" => 2-element BenchmarkTools.BenchmarkGroup:
tags: ["egraphs"]
"addexpr" => Trial(4.827 ms)
"constructor" => Trial(620.994 ns)
"calc_logic" => 2-element BenchmarkTools.BenchmarkGroup:
tags: ["egraph", "logic"]
"freges_theorem" => Trial(44.582 ms)
"demorgan" => Trial(85.800 μs)
Please check the changes. If you feel ok with them, I'll create a PR (with all changes squashed to a single one). Let me know if there are other benchmarks that I can run.
Thanks! Yeah please go ahead! That's an amazing fix.
Glad to see it didn't slow things down.