JuliaSymbolics/Metatheory.jl

Race-condition when using Metatheory from multiple threads

Opened this issue · 3 comments

I'm getting errors when calling saturate for an egraph from multiple threads.
I'm careful not to use any shared variables. The race condition seems to occur inside Metatheory (ale/3.0 branch).

Here is the test script

using Metatheory

function test_threads()
    println("thread pool size: $(Threads.threadpoolsize())")
    Threads.@threads for i in 1:1000
        theory = @theory a b c begin
            a + b == b + a
            a * b == b * a
            a + (b + c) == (a + b) + c
            a * (b * c) == (a * b) * c
        end

        g = EGraph{Expr}(:(1 + (2 + (3 + (4 + (5 + 6))))));
        params = SaturationParams(timeout=100)
        report = saturate!(g, theory, params)
        # println(g)
        # println(report)
    end
end

Calling this from a julia instance with 12 threads julia -t 12: leads to the following output (example)

julia> test_threads()
thread pool size: 12
ERROR: TaskFailedException

    nested task error: WARNING: both TermInterface and Meta export "isexpr"; uses of it in module Metatheory must be qualified
BoundsError: attempt to access 81-element Vector{UInt64} at index [247]
    Stacktrace:
      [1] getindex
        @ ./essentials.jl:13 [inlined]
      [2] getindex
        @ ./abstractarray.jl:1291 [inlined]
      [3] find
        @ ~/.julia/dev/Metatheory/src/EGraphs/unionfind.jl:21 [inlined]
      [4] find
        @ ~/.julia/dev/Metatheory/src/EGraphs/egraph.jl:226 [inlined]
      [5] canonicalize!(g::EGraph{Expr, Nothing}, n::Vector{UInt64})
        @ Metatheory.EGraphs ~/.julia/dev/Metatheory/src/EGraphs/egraph.jl:250
      [6] add!(g::EGraph{Expr, Nothing}, n::Vector{UInt64}, should_copy::Bool)
        @ Metatheory.EGraphs ~/.julia/dev/Metatheory/src/EGraphs/egraph.jl:279
      [7] instantiate_enode!(bindings::SubArray{UInt128, 1, Vector{…}, Tuple{…}, true}, g::EGraph{Expr, Nothing}, p::PatExpr)
        @ Metatheory.EGraphs ~/.julia/dev/Metatheory/src/EGraphs/saturation.jl:133
      [8] apply_rule!(bindings::SubArray{…}, g::EGraph{…}, rule::EqualityRule, id::UInt64, direction::Int64, merges_buffer::OptBuffer{…})
        @ Metatheory.EGraphs ~/.julia/dev/Metatheory/src/EGraphs/saturation.jl:145
      [9] eqsat_apply!(g::EGraph{…}, theory::Vector{…}, rep::Metatheory.EGraphs.SaturationReport, params::SaturationParams, ematch_buffer::OptBuffer{…}, merges_buffer::OptBuffer{…})
        @ Metatheory.EGraphs ~/.julia/dev/Metatheory/src/EGraphs/saturation.jl:241
     [10] macro expansion
        @ ~/.julia/packages/TimerOutputs/Lw5SP/src/TimerOutput.jl:237 [inlined]
     [11] eqsat_step!(g::EGraph{…}, theory::Vector{…}, curr_iter::Int64, scheduler::Metatheory.EGraphs.Schedulers.BackoffScheduler, params::SaturationParams, report::Metatheory.EGraphs.SaturationReport, ematch_buffer::OptBuffer{…}, merges_buffer::OptBuffer{…})
        @ Metatheory.EGraphs ~/.julia/dev/Metatheory/src/EGraphs/saturation.jl:299
     [12] saturate!(g::EGraph{Expr, Nothing}, theory::Vector{RewriteRule}, params::SaturationParams)
        @ Metatheory.EGraphs ~/.julia/dev/Metatheory/src/EGraphs/saturation.jl:336
     [13] macro expansion
        @ ~/cuda/test_parallel_eqsat.jl:17 [inlined]
     [14] (::var"#10#threadsfor_fun#2"{var"#10#threadsfor_fun#1#3"{UnitRange{Int64}}})(tid::Int64; onethread::Bool)
        @ Main ./threadingconstructs.jl:214
     [15] #10#threadsfor_fun
        @ Main ./threadingconstructs.jl:181 [inlined]
     [16] (::Base.Threads.var"#1#2"{var"#10#threadsfor_fun#2"{var"#10#threadsfor_fun#1#3"{UnitRange{Int64}}}, Int64})()
        @ Base.Threads ./threadingconstructs.jl:153

...and 10 more exceptions.

Stacktrace:
 [1] threading_run(fun::var"#10#threadsfor_fun#2"{var"#10#threadsfor_fun#1#3"{UnitRange{Int64}}}, static::Bool)
   @ Base.Threads ./threadingconstructs.jl:171
 [2] macro expansion
   @ Main ./threadingconstructs.jl:219 [inlined]
 [3] test_threads()
   @ Main ~/cuda/test_parallel_eqsat.jl:7
 [4] top-level scope
   @ REPL[2]:1
Some type information was truncated. Use `show(err)` to see complete types.

With a single thread there is no error.

What branch are you on? Are you on latest MT ale/3.0? We stopped using global buffers. The thing is, if you saturate a different e-graph object per-thread, this shouldn't happen and it's indeed a bug.

If you instead are trying to call saturate! on the same e-graph but from different threads, there's no mechanism for parallel saturation, and it's a hard task. Searching is fine, but applying matches must happen sequentially and after you union e-classes, ids are going to change, so if you change the ordering you also change the behavior.

I pulled the latest changes from ale/3.0 but the problem persists. Am I missing something?

(@v1.10) pkg> status Metatheory
Status `~/.julia/environments/v1.10/Project.toml`
  [e9d8d322] Metatheory v3.0.0 `~/.julia/dev/Metatheory`

shell> cd ~/.julia/dev/Metatheory/
/home/gkronber/.julia/dev/Metatheory

shell> git status
On branch ale/3.0
Your branch is up to date with 'origin/ale/3.0'.

nothing to commit, working tree clean

shell> git log
commit 7709c902afbfd880405ec0c0916eee95c2e6d662 (HEAD -> ale/3.0, origin/ale/3.0)
Author: Alessandro Cheli <a@a.a>
Date:   Fri Jun 28 12:14:14 2024 +0200

    remove merges buffer

Yes, I only need the first case: saturate independent egraph objects on parallel threads.

I pulled the latest changes from ale/3.0 but the problem persists. Am I missing something?

(@v1.10) pkg> status Metatheory
Status `~/.julia/environments/v1.10/Project.toml`
  [e9d8d322] Metatheory v3.0.0 `~/.julia/dev/Metatheory`

shell> cd ~/.julia/dev/Metatheory/
/home/gkronber/.julia/dev/Metatheory

shell> git status
On branch ale/3.0
Your branch is up to date with 'origin/ale/3.0'.

nothing to commit, working tree clean

shell> git log
commit 7709c902afbfd880405ec0c0916eee95c2e6d662 (HEAD -> ale/3.0, origin/ale/3.0)
Author: Alessandro Cheli <a@a.a>
Date:   Fri Jun 28 12:14:14 2024 +0200

    remove merges buffer

Yes, I only need the first case: saturate independent egraph objects on parallel threads.

I will investigate further. I think it's because each rule has its own custom callstack