GPU [EnsembleGPUAutonomous] has no performance advantage over CPU [EnsembleThreads]?

Question

GPU [EnsembleGPUAutonomous] has no performance advantage over CPU [EnsembleThreads]?

TheStarAlight opened this issue 2 years ago · 22 comments

Hello everyone! I'm working on a program using ensemble simulation. I noticed that in issue #141 @ChrisRackauckas mentioned that a kernel version of Tsit5 would be ~100x faster. Today I updated the DiffEqGPU to v1.17 and used GPUAutonomous/GPUTsit5 option to run a benchmark like the test does:

using DiffEqGPU, OrdinaryDiffEq, CUDA, StaticArrays, BenchmarkTools
function lorenz(u, p, t)
    σ = p[1]
    ρ = p[2]
    β = p[3]
    du1 = σ * (u[2] - u[1])
    du2 = u[1] * (ρ - u[3]) - u[2]
    du3 = u[1] * u[2] - β * u[3]
    return @SVector [du1, du2, du3]
end

u0 = @SVector [1.0f0; 0.0f0; 0.0f0]
tspan = (0.0f0, 10.0f0)
p = @SVector [10.0f0, 28.0f0, 8 / 3.0f0]
prob = ODEProblem(lorenz, u0, tspan, p)
prob_func = (prob, i, repeat) -> remake(prob, p=p)
monteprob = EnsembleProblem(prob, prob_func=prob_func, safetycopy=false)
# run once to compile.
CUDA.@time sol = solve(monteprob, GPUTsit5(), EnsembleGPUAutonomous(), trajectories=10000, adaptive=false, dt=0.1f0)
     @time sol = solve(monteprob,    Tsit5(), EnsembleThreads(),       trajectories=10000, adaptive=false, dt=0.1f0)
@info "GPU Kernel"
bg = @benchmark sol = solve(monteprob, GPUTsit5(), EnsembleGPUAutonomous(), trajectories=10000, adaptive=false, dt=0.1f0)
display(bg)
@info "CPU Parallel"
bc = @benchmark sol = solve(monteprob,    Tsit5(), EnsembleThreads(),       trajectories=10000, adaptive=false, dt=0.1f0)
display(bc)

However the result is still not satisfactory:

[ Info: GPU Kernel
BenchmarkTools.Trial: 10 samples with 1 evaluation.
Range (min … max): 428.174 ms … 598.739 ms ┊ GC (min … max): 13.16% … 18.42%
Time (median): 527.766 ms ┊ GC (median): 16.21%
Time (mean ± σ): 524.246 ms ± 49.242 ms ┊ GC (mean ± σ): 15.93% ± 2.46%
Memory estimate: 802.14 MiB, allocs estimate: 940165.

[ Info: CPU Parallel
BenchmarkTools.Trial: 14 samples with 1 evaluation.
Range (min … max): 261.182 ms … 641.897 ms ┊ GC (min … max): 0.00% … 59.71%
Time (median): 376.021 ms ┊ GC (median): 13.84%
Time (mean ± σ): 372.574 ms ± 92.912 ms ┊ GC (mean ± σ): 17.98% ± 17.36%
Memory estimate: 189.50 MiB, allocs estimate: 1288999.

I wonder why the acceleration is still not obvious in the kernel version?

Answer 1 · 2022-08-17T09:13:57.000Z

Thanks, there's still more optimizations to be added but we can use this as a benchmark target.

Answer 2 · 2022-08-17T13:47:00.000Z

Thanks, there's still more optimizations to be added but we can use this as a benchmark target.

I see in the source code that the algorithm utilizes one GPU thread to solve a single trajectory, I wonder if this scheme really accelerates the solving. In my opinion, solving a trajectory throughout still seems too complicated for a single GPU thread, because in GPU matrix operations, each thread just does a very simple job and the parallelism is fully utilized. I mean is it possible to transform the procedure of solving ensemble ODE problems into some basic matrix operations so that we can make full use of the GPU parallelism?
I'm green hand and that's just my supposition. I hope it helps in bringing some new ideas of optimization :D

Answer 3 · 2022-08-17T13:49:34.000Z

#170 (comment)

It's at least already 31x faster than ensembling into basic matrix operations like EnsembleGPUArray does. But you can also start profiling EnsembleGPUArray and look for anything we may have missed: this route seems to have been a lot more fruitful at filling the GPU kernels though. The problem with doing it at the matrix operations is just that you get low utilization because the kernels are not large enough. They aren't matmuls, but O(n) broadcast types of expressions.

Answer 4 · 2022-08-17T13:50:21.000Z

But indeed there are some issues with per-kernel. I think we need to make sure that it's splitting per-warp instead of doing all together, that way it's maximum steps per warp instead of over the whole batch which could be a significant difference.

Answer 5 · 2022-08-17T14:22:01.000Z

#170 (comment)

It's at least already 31x faster than ensembling into basic matrix operations like EnsembleGPUArray does. But you can also start profiling EnsembleGPUArray and look for anything we may have missed: this route seems to have been a lot more fruitful at filling the GPU kernels though. The problem with doing it at the matrix operations is just that you get low utilization because the kernels are not large enough. They aren't matmuls, but O(n) broadcast types of expressions.

Thanks for reply. Would you like to try this example? I did the test on my laptop (i5-9300H & GTX1650, 3 years old), and found the GPUTsit5 didn't accelerate at all. Should I add save_everystep=false? I tried earlier but an error would be thrown and I couldn't figure out why (just very dumb in reading code...).

using DiffEqGPU, OrdinaryDiffEq, CUDA, StaticArrays, BenchmarkTools
function lorenz(u, p, t)
    σ = p[1]
    ρ = p[2]
    β = p[3]
    du1 = σ * (u[2] - u[1])
    du2 = u[1] * (ρ - u[3]) - u[2]
    du3 = u[1] * u[2] - β * u[3]
    return @SVector [du1, du2, du3]
end

u0 = @SVector [1.0f0; 0.0f0; 0.0f0]
tspan = (0.0f0, 10.0f0)
p = @SVector [10.0f0, 28.0f0, 8 / 3.0f0]
prob = ODEProblem(lorenz, u0, tspan, p)
prob_func = (prob, i, repeat) -> remake(prob, p=p)
monteprob = EnsembleProblem(prob, prob_func=prob_func, safetycopy=false)
# run once to compile.
CUDA.@time sol = solve(monteprob,    Tsit5(), EnsembleThreads(),       trajectories=10000, adaptive=false, dt=0.1f0)
CUDA.@time sol = solve(monteprob,    Tsit5(), EnsembleCPUArray(),      trajectories=10000, adaptive=false, dt=0.1f0)
CUDA.@time sol = solve(monteprob,    Tsit5(), EnsembleGPUArray(),      trajectories=10000, adaptive=false, dt=0.1f0)
CUDA.@time sol = solve(monteprob, GPUTsit5(), EnsembleGPUAutonomous(), trajectories=10000, adaptive=false, dt=0.1f0)
@info "Testing..."
b1 = @benchmark sol = solve(monteprob,    Tsit5(), EnsembleThreads(),       trajectories=10000, adaptive=false, dt=0.1f0)
b2 = @benchmark sol = solve(monteprob,    Tsit5(), EnsembleCPUArray(),      trajectories=10000, adaptive=false, dt=0.1f0)
b3 = @benchmark sol = solve(monteprob,    Tsit5(), EnsembleGPUArray(),      trajectories=10000, adaptive=false, dt=0.1f0)
b4 = @benchmark sol = solve(monteprob, GPUTsit5(), EnsembleGPUAutonomous(), trajectories=10000, adaptive=false, dt=0.1f0, save_everystep=false)
@info "EnsembleThreads"
display(b1)
@info "EnsembleCPUArray"
display(b2)
@info "EnsembleGPUArray"
display(b3)
@info "EnsembleGPUAutonomous"
display(b4)

The mean times respectively:
189.301 ms
855.899 ms
695.308 ms
418.544 ms
Just very weird.

Answer 6 · 2022-08-17T14:29:11.000Z

BTW, why don't you use CUDA.@time for time measurement? CUDA.jl suggests using its macros to measure time of GPU tasks. https://cuda.juliagpu.org/stable/development/profiling/

Answer 7 · 2022-08-17T14:45:33.000Z

It is really weird how you made it to accelerate so much while on my laptop it didn't show acceleration. I think there must be some mistakes in my benchmark code. May I have your benchmark code? I would like to have a test :D

Answer 8 · 2022-08-17T14:47:46.000Z

Maybe it's GPU-dependent? What kind of GPU?

Answer 9 · 2022-08-17T14:48:09.000Z

oh 1650, maybe because it's an old slow GPU? Smaller registers / more memory pressure hits a bad case?

Answer 10 · 2022-08-17T14:55:45.000Z

oh 1650, maybe because it's an old slow GPU? Smaller registers / more memory pressure hits a bad case?

GTX 1650 is not so bad? It has 4 GB VRAM and 10,000 trajs should be ok. Anyway, GPU-based CUFFT is ~30x faster on my laptop than CPU-based FFTW. It should also be the case here.

Answer 11 · 2022-08-17T15:02:54.000Z

Switching traj number to 100 still gives the same result. No faster than EnsembleThreads.

Answer 12 · 2022-08-17T15:09:18.000Z

Hi @TheStarAlight, thanks for all your comments and feedback. It seems to me that you're not on the latest master of DiffEqGPU. EnsembleGPUAutonomous was renamed to EnsembleGPUKernel along with some optimizations and feedback. #170

To address your question, the solvers are really fast; even building SciMLSolution is very slow compared to it.
The raw solutions are in the order of ~ 26-50 μs (just ts,us as CuArrays), not exporting them to SciMLSolution and converting them to CPUArrays. The overheads are huge. We are working on that.
#171

I ran your code just now (try this):

using DiffEqGPU, OrdinaryDiffEq, CUDA, StaticArrays, BenchmarkTools
function lorenz(u, p, t)
    σ = p[1]
    ρ = p[2]
    β = p[3]
    du1 = σ * (u[2] - u[1])
    du2 = u[1] * (ρ - u[3]) - u[2]
    du3 = u[1] * u[2] - β * u[3]
    return @SVector [du1, du2, du3]
end

u0 = @SVector [1.0f0; 0.0f0; 0.0f0]
tspan = (0.0f0, 10.0f0)
p = @SVector [10.0f0, 28.0f0, 8 / 3.0f0]
prob = ODEProblem(lorenz, u0, tspan, p)
prob_func = (prob, i, repeat) -> remake(prob, p=p)
monteprob = EnsembleProblem(prob, prob_func=prob_func, safetycopy=false)
# run once to compile.
CUDA.@time sol = solve(monteprob, Tsit5(), EnsembleThreads(), trajectories=10000, adaptive=false, dt=0.1f0)
CUDA.@time sol = solve(monteprob, Tsit5(), EnsembleCPUArray(), trajectories=10000, adaptive=false, dt=0.1f0)
CUDA.@time sol = solve(monteprob, Tsit5(), EnsembleGPUArray(), trajectories=10000, adaptive=false, dt=0.1f0)
CUDA.@time sol = solve(monteprob, GPUTsit5(), EnsembleGPUKernel(), trajectories=10000, adaptive=false, dt=0.1f0)
@info "Testing..."
b1 = @benchmark sol = solve(monteprob, Tsit5(), EnsembleThreads(), trajectories=10000, adaptive=false, dt=0.1f0)
#27.393 ms
b2 = @benchmark sol = solve(monteprob, Tsit5(), EnsembleCPUArray(), trajectories=10000, adaptive=false, dt=0.1f0)
#860.229 ms
b3 = @benchmark @CUDA.sync sol = solve(monteprob, Tsit5(), EnsembleGPUArray(), trajectories=10000, adaptive=false, dt=0.1f0)
#206.077 ms
b4 = @benchmark @CUDA.sync sol = solve(monteprob, GPUTsit5(), EnsembleGPUKernel(), trajectories=10000, adaptive=false, dt=0.1f0)
#7.431 ms
@info "EnsembleThreads"
display(b1)
@info "EnsembleCPUArray"
display(b2)
@info "EnsembleGPUArray"
display(b3)
@info "EnsembleGPUKernel"
display(b4)

You can use EnsembleGPUKernel(0.0) to offload everything to GPU. The constant refers to how much you want to offload to the CPU. https://github.com/SciML/DiffEqGPU.jl/blob/master/src/DiffEqGPU.jl#L166
I am on NVIDIA A100 80GB PCIe, which is definitely a better GPU. I could run this code on a laptop GPU.

Answer 13 · 2022-08-17T15:12:45.000Z

Oh, it seems that the latest master hasn't been tagged. @ChrisRackauckas

Answer 14 · 2022-08-17T15:14:09.000Z

Thank you very much @ChrisRackauckas ! I see you renamed Autonomous to Kernel but the changes have not released, still v1.17. I'm trying your latest code now.

Answer 15 · 2022-08-17T15:18:25.000Z

@utkarsh530 and thank you as well! :D

Answer 16 · 2022-08-17T15:33:24.000Z

I'm sorry... new in julia, doesn't know how to use your master code without installing the pkg... It seems that I have to install a lot of dependency pkgs. May I test your optimized code after you published v1.18? I'll keep watching :D

Answer 17 · 2022-08-17T15:40:15.000Z

You can do ] dev DiffEqGPU

Answer 18 · 2022-08-17T15:58:36.000Z

Thank you! @utkarsh530 @ChrisRackauckas I tested and found the current version [071ae1c0] much faster!
EnsembleThreads: 199.472 ms ± 223.464 ms
EnsembleCPUArray: 1.089 s ± 109.262 ms
EnsembleGPUArray: 887.967 ms ± 186.571 ms
EnsembleGPUKernel: 23.962 ms ± 11.475 ms
To sum up, GPUKernel is ~40x faster than CPU/GPUArray and ~10x faster than EnsembleThreads!
Thank you very much, I think now I can implement GPU option in my program :D
Looking forward to its release!

Answer 19 · 2022-08-18T01:40:26.000Z

Hi @utkarsh530 ! I see your benchmark here https://github.com/SciML/DiffEqGPU.jl/pull/170#issuecomment-1201483078, where you used save_everystep=false, this matches my case because I only need the final state of particles. However on master branch it is not available, a MethodError would be thrown due to parameter mismatch. I tracked the stack and found that it pointed to another pkg SimpleDiffEq, where the gpuatsit5.jl: solve didn't support save_everystep argument. Even setting it to true raises an exception.
How can I enable save_everystep=false option? Thanks very much!

julia> sol = solve(monteprob, GPUTsit5(), EnsembleGPUKernel(), trajectories=10000, adaptive=false, dt=0.1f0, saveat=nothing, save_everystep=false)
ERROR: TaskFailedException
Stacktrace:
 [1] wait
   @ .\task.jl:334 [inlined]
 [2] __solve(ensembleprob::EnsembleProblem{ODEProblem{SVector{3, Float32}, Tuple{Float32, Float32}, false, SVector{3, Float32}, ODEFunction{false, typeof(lorenz), LinearAlgebra.UniformScaling{Bool}, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, typeof(SciMLBase.DEFAULT_OBSERVED), Nothing, Nothing}, Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}}, SciMLBase.StandardODEProblem}, var"#1#2", typeof(SciMLBase.DEFAULT_OUTPUT_FUNC), typeof(SciMLBase.DEFAULT_REDUCTION), Nothing}, alg::GPUTsit5, ensemblealg::EnsembleGPUKernel; trajectories::Int64, batch_size::Int64, unstable_check::Function, adaptive::Bool, kwargs::Base.Pairs{Symbol, Union{Nothing, Real}, Tuple{Symbol, Symbol, Symbol}, NamedTuple{(:dt, :saveat, :save_everystep), Tuple{Float32, Nothing, Bool}}})
   @ DiffEqGPU C:\Users\ZMY\.julia\dev\DiffEqGPU\src\DiffEqGPU.jl:235
 [3] #solve#35
   @ C:\Users\ZMY\.julia\packages\DiffEqBase\72SnT\src\solve.jl:818 [inlined]
 [4] top-level scope
   @ REPL[14]:1
 [5] top-level scope
   @ C:\Users\ZMY\.julia\packages\CUDA\DfvRa\src\initialization.jl:52

    nested task error: TaskFailedException
    Stacktrace:
     [1] wait
       @ .\task.jl:334 [inlined]
     [2] threading_run(func::Function)
       @ Base.Threads .\threadingconstructs.jl:38
     [3] macro expansion
       @ .\threadingconstructs.jl:97 [inlined]
     [4] tmap(f::Function, args::UnitRange{Int64})
       @ SciMLBase C:\Users\ZMY\.julia\packages\SciMLBase\cheTp\src\ensemble\basic_ensemble_solve.jl:173
     [5] #solve_batch#505
       @ C:\Users\ZMY\.julia\packages\SciMLBase\cheTp\src\ensemble\basic_ensemble_solve.jl:164 [inlined]
     [6] f
       @ C:\Users\ZMY\.julia\dev\DiffEqGPU\src\DiffEqGPU.jl:221 [inlined]
     [7] macro expansion
       @ C:\Users\ZMY\.julia\dev\DiffEqGPU\src\DiffEqGPU.jl:226 [inlined]
     [8] (::DiffEqGPU.var"#8#14"{DiffEqGPU.var"#f#13"{Base.Pairs{Symbol, Union{Nothing, Real}, Tuple{Symbol, Symbol, Symbol}, NamedTuple{(:dt, :saveat, :save_everystep), Tuple{Float32, Nothing, Bool}}}, EnsembleProblem{ODEProblem{SVector{3, Float32}, Tuple{Float32, Float32}, false, SVector{3, Float32}, ODEFunction{false, typeof(lorenz), LinearAlgebra.UniformScaling{Bool}, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, typeof(SciMLBase.DEFAULT_OBSERVED), Nothing, Nothing}, Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}}, SciMLBase.StandardODEProblem}, var"#1#2", typeof(SciMLBase.DEFAULT_OUTPUT_FUNC), typeof(SciMLBase.DEFAULT_REDUCTION), Nothing}, SimpleDiffEq.GPUSimpleTsit5, UnitRange{Int64}}})()
       @ DiffEqGPU .\task.jl:123

        nested task error: MethodError: no method matching solve(::ODEProblem{SVector{3, Float32}, Tuple{Float32, Float32}, false, SVector{3, Float32}, ODEFunction{false, typeof(lorenz), LinearAlgebra.UniformScaling{Bool}, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, typeof(SciMLBase.DEFAULT_OBSERVED), Nothing, Nothing}, Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}}, SciMLBase.StandardODEProblem}, ::SimpleDiffEq.GPUSimpleTsit5; dt=0.1f0, saveat=nothing, save_everystep=false)
        Closest candidates are:
          solve(::ODEProblem, ::SimpleDiffEq.GPUSimpleTsit5; dt) at C:\Users\ZMY\.julia\packages\SimpleDiffEq\vIpeG\src\tsit5\gpuatsit5.jl:9 got unsupported keyword arguments "saveat", "save_everystep"
          solve(::SciMLBase.AbstractDEProblem, ::Any...; sensealg, u0, p, kwargs...) at C:\Users\ZMY\.julia\packages\DiffEqBase\72SnT\src\solve.jl:765
          solve(::Any...; kwargs...) at C:\Users\ZMY\.julia\packages\CommonSolve\TGRvG\src\CommonSolve.jl:23
          ...
        Stacktrace:
         [1] kwerr(::NamedTuple{(:dt, :saveat, :save_everystep), Tuple{Float32, Nothing, Bool}}, ::Function, ::ODEProblem{SVector{3, Float32}, Tuple{Float32, Float32}, false, SVector{3, Float32}, ODEFunction{false, typeof(lorenz), LinearAlgebra.UniformScaling{Bool}, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, typeof(SciMLBase.DEFAULT_OBSERVED), Nothing, Nothing}, Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}}, SciMLBase.StandardODEProblem}, ::SimpleDiffEq.GPUSimpleTsit5)
           @ Base .\error.jl:163
         [2] batch_func(i::Int64, prob::EnsembleProblem{ODEProblem{SVector{3, Float32}, Tuple{Float32, Float32}, false, SVector{3, Float32}, ODEFunction{false, typeof(lorenz), LinearAlgebra.UniformScaling{Bool}, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, typeof(SciMLBase.DEFAULT_OBSERVED), Nothing, Nothing}, Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}}, SciMLBase.StandardODEProblem}, var"#1#2", typeof(SciMLBase.DEFAULT_OUTPUT_FUNC), typeof(SciMLBase.DEFAULT_REDUCTION), Nothing}, alg::SimpleDiffEq.GPUSimpleTsit5; kwargs::Base.Pairs{Symbol, Union{Nothing, Real}, Tuple{Symbol, Symbol, Symbol}, NamedTuple{(:dt, :saveat, :save_everystep), Tuple{Float32, Nothing, Bool}}})
           @ SciMLBase C:\Users\ZMY\.julia\packages\SciMLBase\cheTp\src\ensemble\basic_ensemble_solve.jl:92
         [3] (::SciMLBase.var"#507#509"{Base.Pairs{Symbol, Union{Nothing, Real}, Tuple{Symbol, Symbol, Symbol}, NamedTuple{(:dt, :saveat, :save_everystep), Tuple{Float32, Nothing, Bool}}}, EnsembleProblem{ODEProblem{SVector{3, Float32}, Tuple{Float32, Float32}, false, SVector{3, Float32}, ODEFunction{false, typeof(lorenz), LinearAlgebra.UniformScaling{Bool}, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, typeof(SciMLBase.DEFAULT_OBSERVED), Nothing, Nothing}, Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}}, SciMLBase.StandardODEProblem}, var"#1#2", typeof(SciMLBase.DEFAULT_OUTPUT_FUNC), typeof(SciMLBase.DEFAULT_REDUCTION), Nothing}, SimpleDiffEq.GPUSimpleTsit5})(i::Int64)
           @ SciMLBase C:\Users\ZMY\.julia\packages\SciMLBase\cheTp\src\ensemble\basic_ensemble_solve.jl:165
         [4] macro expansion
           @ C:\Users\ZMY\.julia\packages\SciMLBase\cheTp\src\ensemble\basic_ensemble_solve.jl:174 [inlined]
         [5] (::SciMLBase.var"#400#threadsfor_fun#510"{SciMLBase.var"#507#509"{Base.Pairs{Symbol, Union{Nothing, Real}, Tuple{Symbol, Symbol, Symbol}, NamedTuple{(:dt, :saveat, :save_everystep), Tuple{Float32, Nothing, Bool}}}, EnsembleProblem{ODEProblem{SVector{3, Float32}, Tuple{Float32, Float32}, false, SVector{3, Float32}, ODEFunction{false, typeof(lorenz), LinearAlgebra.UniformScaling{Bool}, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, typeof(SciMLBase.DEFAULT_OBSERVED), Nothing, Nothing}, Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}}, SciMLBase.StandardODEProblem}, var"#1#2", typeof(SciMLBase.DEFAULT_OUTPUT_FUNC), typeof(SciMLBase.DEFAULT_REDUCTION), Nothing}, SimpleDiffEq.GPUSimpleTsit5}, Tuple{UnitRange{Int64}}, Vector{Union{}}, UnitRange{Int64}})(onethread::Bool)
           @ SciMLBase .\threadingconstructs.jl:85
         [6] (::SciMLBase.var"#400#threadsfor_fun#510"{SciMLBase.var"#507#509"{Base.Pairs{Symbol, Union{Nothing, Real}, Tuple{Symbol, Symbol, Symbol}, NamedTuple{(:dt, :saveat, :save_everystep), Tuple{Float32, Nothing, Bool}}}, EnsembleProblem{ODEProblem{SVector{3, Float32}, Tuple{Float32, Float32}, false, SVector{3, Float32}, ODEFunction{false, typeof(lorenz), LinearAlgebra.UniformScaling{Bool}, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, typeof(SciMLBase.DEFAULT_OBSERVED), Nothing, Nothing}, Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}}, SciMLBase.StandardODEProblem}, var"#1#2", typeof(SciMLBase.DEFAULT_OUTPUT_FUNC), typeof(SciMLBase.DEFAULT_REDUCTION), Nothing}, SimpleDiffEq.GPUSimpleTsit5}, Tuple{UnitRange{Int64}}, Vector{Union{}}, UnitRange{Int64}})()
           @ SciMLBase .\threadingconstructs.jl:52

Answer 20 · 2022-08-18T02:59:37.000Z

Everything should be released now.

Answer 21 · 2022-08-18T13:33:03.000Z

@TheStarAlight I suggest you use adaptive = true with save_everystep = false. That’s faster and adheres to your tolerances.

Answer 22 · 2022-08-18T14:53:00.000Z

Thank you! I see this update and I'm trying to implement the GPU option in my program. But I encountered some problems on the parameter support (it seems that adaptive = true with save_everystep = false would throw an error on v1.18). Since it is nothing to do with this issue, I shall open a new issue on it.