SciML/DiffEqGPU.jl

Recent benchmarks of Kernelized GPU ODE solvers with MPGOS

utkarsh530 opened this issue · 30 comments

Comparison of GPUTsit5 with MPGOS:
I used this script here (The author who compared MPGOS and Julia Stuff earlier) https://discourse.julialang.org/t/performance-problems-with-parallel-ensemble-simulation-on-gpu/34596 (taking minimum of simulation times), and it seems that we are currently faster than MPGOS,in terms of only the raw solve.
Threads = 768000
Batch size = 768000
GPUTsit5:

println("Minimum time: "*string(minimum(data.times)/1e6)*" ms")
Minimum time: 14.594967 ms

MPGOS:

Total simulation time:           16.749ms
Simulation time / 1000 RK4 step: 16.749ms
Ensemble size:                   768000

Here is the log-log plot of simulation times plotted with the number of trajectories.
plot_GPUTsit5_MPGOS_log_log

Caveat:

Some of the issues with using DiffEqGPU API with GPUTsit5 is listed here: #171. So, to extract maximum performance from these solvers, I custom-generated ensemble problems as shown in the script below and call the lower-level API directly.

using OrdinaryDiffEq, DiffEqGPU, BenchmarkTools, StaticArrays, CPUTime, SimpleDiffEq
using CUDA

@show ARGS
#settings
numberOfParameters = 768000
gpuID = 0

device!(CuDevice(gpuID))
println("Running on "*string(CuDevice(gpuID)))

function lorenz(u, p, t)
    σ = p[1]
    ρ = p[2]
    β = p[3]
    du1 = σ * (u[2] - u[1])
    du2 = u[1] * (ρ - u[3]) - u[2]
    du3 = u[1] * u[2] - β * u[3]
    return @SVector [du1, du2, du3]
end

u0 = @SVector [1.0f0; 0.0f0; 0.0f0]
tspan = (0.0f0, 10.0f0)
p = @SVector [10.0f0, 28.0f0, 8 / 3.0f0]
prob = ODEProblem(lorenz, u0, tspan, p)

parameterList = range(0.0f0,stop = 21.0f0,length=numberOfParameters)

lorenzProblem =  ODEProblem(lorenz,u0,tspan,p)
#parameterChange = (prob,i,repeat) -> remake(prob,p=parameterList[i]) #it has slightly more allocations
prob_func = (prob, i, repeat) -> remake(prob, p = (@SVector [parameterList[i], p[2] ,p[3]]))

ensembleProb = EnsembleProblem(lorenzProblem,prob_func = prob_func)

## Building problems here only
I = 1:numberOfParameters
if ensembleProb.safetycopy
    probs = map(I) do i
        ensembleProb.prob_func(deepcopy(ensembleProb.prob), i, 1)
    end
else
    probs = map(I) do i
        ensembleProb.prob_func(ensembleProb.prob, i, 1)
    end
end

## Make them compatible with CUDA
probs = cu(probs)

@info "Solving the problem"
data = @benchmark @CUDA.sync DiffEqGPU.vectorized_solve($probs, $ensembleProb.prob, GPUSimpleTsit5();
save_everystep = false, dt = 0.01f0)


println("Parameter number: "*string(numberOfParameters))
println("Minimum time: "*string(minimum(data.times)/1e6)*" ms")
println("Allocs: "*string(data.allocs))

To be fair, vectorized_solve does really represent the exact solve time because benchmarking should not include conversion other extraneous overheads. DiffEqGPU provides ease for the users to provide generic API compatible with other stuff from DifferentialEquations.jl.

Possible Solution:

Write a completely different API, which will require to build the users to build problems on their own and pass those array of problems to GPUTsit5.

I wonder if the architecture specified in their makefile is why we doing see better MPGOS results on the newer hardware. https://github.com/nnagyd/ode_solver_tests/blob/master/Lorenz_RK4/GPU_MPGOS/makefile

It looks like your TESLA v100 is sm_70.

I am testing on a clean OS with an A100 40 GB GPU and

(@v1.8) pkg> st
Status `~/.julia/environments/v1.8/Project.toml`
  [6e4b80f9] BenchmarkTools v1.3.1
  [a9c8d775] CPUTime v1.0.0
  [052768ef] CUDA v3.12.0
  [071ae1c0] DiffEqGPU v1.20.0
  [1dea7af3] OrdinaryDiffEq v6.29.3
  [05bca326] SimpleDiffEq v1.7.0
  [90137ffa] StaticArrays v1.5.9

and am getting the following during the DiffEqGPU.vectorized_solve

ERROR: InvalidIRError: compiling kernel #tsit5_kernel(CuDeviceVector{ODEProblem{SVector{3, Float32}, Tuple{Float32, Float32}, false, SVector{3, Float32}, ODEFunction{false, SciMLBase.AutoSpecialize, typeof(lorenz), LinearAlgebra.UniformScaling{Bool}, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, typeof(SciMLBase.DEFAULT_OBSERVED), Nothing, Nothing}, Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}}, SciMLBase.StandardODEProblem}, 1}, CuDeviceMatrix{SVector{3, Float32}, 1}, CuDeviceMatrix{Float32, 1}, Float32, CallbackSet{Tuple{}, Tuple{}}, Nothing, Int64, Nothing, Val{false}) resulted in invalid LLVM IR
Reason: unsupported dynamic function invocation (call to apply_discrete_callback!)
Stacktrace:
 [1] step!
   @ ~/.julia/packages/DiffEqGPU/CiiCq/src/perform_step/gpu_tsit5_perform_step.jl:60
 [2] tsit5_kernel
   @ ~/.julia/packages/DiffEqGPU/CiiCq/src/perform_step/gpu_tsit5_perform_step.jl:105
Hint: catch this exception as `err` and call `code_typed(err; interactive = true)` to introspect the erronous code with Cthulhu.jl
Stacktrace:
  [1] check_ir(job::GPUCompiler.CompilerJob{GPUCompiler.PTXCompilerTarget, CUDA.CUDACompilerParams, GPUCompiler.FunctionSpec{typeof(DiffEqGPU.tsit5_kernel), Tuple{CuDeviceVector{ODEProblem{SVector{3, Float32}, Tuple{Float32, Float32}, false, SVector{3, Float32}, ODEFunction{false, SciMLBase.AutoSpecialize, typeof(lorenz), LinearAlgebra.UniformScaling{Bool}, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, typeof(SciMLBase.DEFAULT_OBSERVED), Nothing, Nothing}, Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}}, SciMLBase.StandardODEProblem}, 1}, CuDeviceMatrix{SVector{3, Float32}, 1}, CuDeviceMatrix{Float32, 1}, Float32, CallbackSet{Tuple{}, Tuple{}}, Nothing, Int64, Nothing, Val{false}}}}, args::LLVM.Module)
    @ GPUCompiler ~/.julia/packages/GPUCompiler/6qAH4/src/validation.jl:141
  [2] macro expansion
    @ ~/.julia/packages/GPUCompiler/6qAH4/src/driver.jl:418 [inlined]
  [3] macro expansion
    @ ~/.julia/packages/TimerOutputs/4yHI4/src/TimerOutput.jl:253 [inlined]
  [4] macro expansion
    @ ~/.julia/packages/GPUCompiler/6qAH4/src/driver.jl:416 [inlined]
  [5] emit_asm(job::GPUCompiler.CompilerJob, ir::LLVM.Module; strip::Bool, validate::Bool, format::LLVM.API.LLVMCodeGenFileType)
    @ GPUCompiler ~/.julia/packages/GPUCompiler/6qAH4/src/utils.jl:68
  [6] cufunction_compile(job::GPUCompiler.CompilerJob, ctx::LLVM.Context)
    @ CUDA ~/.julia/packages/CUDA/DfvRa/src/compiler/execution.jl:354
  [7] #224
    @ ~/.julia/packages/CUDA/DfvRa/src/compiler/execution.jl:347 [inlined]
  [8] JuliaContext(f::CUDA.var"#224#225"{GPUCompiler.CompilerJob{GPUCompiler.PTXCompilerTarget, CUDA.CUDACompilerParams, GPUCompiler.FunctionSpec{typeof(DiffEqGPU.tsit5_kernel), Tuple{CuDeviceVector{ODEProblem{SVector{3, Float32}, Tuple{Float32, Float32}, false, SVector{3, Float32}, ODEFunction{false, SciMLBase.AutoSpecialize, typeof(lorenz), LinearAlgebra.UniformScaling{Bool}, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, typeof(SciMLBase.DEFAULT_OBSERVED), Nothing, Nothing}, Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}}, SciMLBase.StandardODEProblem}, 1}, CuDeviceMatrix{SVector{3, Float32}, 1}, CuDeviceMatrix{Float32, 1}, Float32, CallbackSet{Tuple{}, Tuple{}}, Nothing, Int64, Nothing, Val{false}}}}})
    @ GPUCompiler ~/.julia/packages/GPUCompiler/6qAH4/src/driver.jl:76
  [9] cufunction_compile(job::GPUCompiler.CompilerJob)
    @ CUDA ~/.julia/packages/CUDA/DfvRa/src/compiler/execution.jl:346
 [10] cached_compilation(cache::Dict{UInt64, Any}, job::GPUCompiler.CompilerJob, compiler::typeof(CUDA.cufunction_compile), linker::typeof(CUDA.cufunction_link))
    @ GPUCompiler ~/.julia/packages/GPUCompiler/6qAH4/src/cache.jl:90
 [11] cufunction(f::typeof(DiffEqGPU.tsit5_kernel), tt::Type{Tuple{CuDeviceVector{ODEProblem{SVector{3, Float32}, Tuple{Float32, Float32}, false, SVector{3, Float32}, ODEFunction{false, SciMLBase.AutoSpecialize, typeof(lorenz), LinearAlgebra.UniformScaling{Bool}, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, typeof(SciMLBase.DEFAULT_OBSERVED), Nothing, Nothing}, Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}}, SciMLBase.StandardODEProblem}, 1}, CuDeviceMatrix{SVector{3, Float32}, 1}, CuDeviceMatrix{Float32, 1}, Float32, CallbackSet{Tuple{}, Tuple{}}, Nothing, Int64, Nothing, Val{false}}}; name::Nothing, kwargs::Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}})
    @ CUDA ~/.julia/packages/CUDA/DfvRa/src/compiler/execution.jl:299
 [12] cufunction(f::typeof(DiffEqGPU.tsit5_kernel), tt::Type{Tuple{CuDeviceVector{ODEProblem{SVector{3, Float32}, Tuple{Float32, Float32}, false, SVector{3, Float32}, ODEFunction{false, SciMLBase.AutoSpecialize, typeof(lorenz), LinearAlgebra.UniformScaling{Bool}, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, typeof(SciMLBase.DEFAULT_OBSERVED), Nothing, Nothing}, Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}}, SciMLBase.StandardODEProblem}, 1}, CuDeviceMatrix{SVector{3, Float32}, 1}, CuDeviceMatrix{Float32, 1}, Float32, CallbackSet{Tuple{}, Tuple{}}, Nothing, Int64, Nothing, Val{false}}})
    @ CUDA ~/.julia/packages/CUDA/DfvRa/src/compiler/execution.jl:292
 [13] macro expansion
    @ ~/.julia/packages/CUDA/DfvRa/src/compiler/execution.jl:102 [inlined]
 [14] vectorized_solve(probs::CuArray{ODEProblem{SVector{3, Float32}, Tuple{Float32, Float32}, false, SVector{3, Float32}, ODEFunction{false, SciMLBase.AutoSpecialize, typeof(lorenz), LinearAlgebra.UniformScaling{Bool}, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, typeof(SciMLBase.DEFAULT_OBSERVED), Nothing, Nothing}, Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}}, SciMLBase.StandardODEProblem}, 1, CUDA.Mem.DeviceBuffer}, prob::ODEProblem{SVector{3, Float32}, Tuple{Float32, Float32}, false, SVector{3, Float32}, ODEFunction{false, SciMLBase.AutoSpecialize, typeof(lorenz), LinearAlgebra.UniformScaling{Bool}, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, typeof(SciMLBase.DEFAULT_OBSERVED), Nothing, Nothing}, Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}}, SciMLBase.StandardODEProblem}, alg::GPUSimpleTsit5; dt::Float32, saveat::Nothing, save_everystep::Bool, debug::Bool, callback::CallbackSet{Tuple{}, Tuple{}}, tstops::Nothing, kwargs::Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}})
    @ DiffEqGPU ~/.julia/packages/DiffEqGPU/CiiCq/src/solve.jl:34
 [15] macro expansion
    @ ~/.julia/packages/CUDA/DfvRa/src/utilities.jl:25 [inlined]
 [16] var"##core#315"(probs#313::CuArray{ODEProblem{SVector{3, Float32}, Tuple{Float32, Float32}, false, SVector{3, Float32}, ODEFunction{false, SciMLBase.AutoSpecialize, typeof(lorenz), LinearAlgebra.UniformScaling{Bool}, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, typeof(SciMLBase.DEFAULT_OBSERVED), Nothing, Nothing}, Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}}, SciMLBase.StandardODEProblem}, 1, CUDA.Mem.DeviceBuffer}, ensembleProb#314::EnsembleProblem{ODEProblem{SVector{3, Float32}, Tuple{Float32, Float32}, false, SVector{3, Float32}, ODEFunction{false, SciMLBase.AutoSpecialize, typeof(lorenz), LinearAlgebra.UniformScaling{Bool}, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, typeof(SciMLBase.DEFAULT_OBSERVED), Nothing, Nothing}, Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}}, SciMLBase.StandardODEProblem}, var"#1#2", typeof(SciMLBase.DEFAULT_OUTPUT_FUNC), typeof(SciMLBase.DEFAULT_REDUCTION), Nothing})
    @ Main ~/.julia/packages/BenchmarkTools/7xSXH/src/execution.jl:489
 [17] var"##sample#316"(::Tuple{CuArray{ODEProblem{SVector{3, Float32}, Tuple{Float32, Float32}, false, SVector{3, Float32}, ODEFunction{false, SciMLBase.AutoSpecialize, typeof(lorenz), LinearAlgebra.UniformScaling{Bool}, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, typeof(SciMLBase.DEFAULT_OBSERVED), Nothing, Nothing}, Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}}, SciMLBase.StandardODEProblem}, 1, CUDA.Mem.DeviceBuffer}, EnsembleProblem{ODEProblem{SVector{3, Float32}, Tuple{Float32, Float32}, false, SVector{3, Float32}, ODEFunction{false, SciMLBase.AutoSpecialize, typeof(lorenz), LinearAlgebra.UniformScaling{Bool}, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, typeof(SciMLBase.DEFAULT_OBSERVED), Nothing, Nothing}, Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}}, SciMLBase.StandardODEProblem}, var"#1#2", typeof(SciMLBase.DEFAULT_OUTPUT_FUNC), typeof(SciMLBase.DEFAULT_REDUCTION), Nothing}}, __params::BenchmarkTools.Parameters)
    @ Main ~/.julia/packages/BenchmarkTools/7xSXH/src/execution.jl:495
 [18] _run(b::BenchmarkTools.Benchmark, p::BenchmarkTools.Parameters; verbose::Bool, pad::String, kwargs::Base.Pairs{Symbol, Integer, NTuple{4, Symbol}, NamedTuple{(:samples, :evals, :gctrial, :gcsample), Tuple{Int64, Int64, Bool, Bool}}})
    @ BenchmarkTools ~/.julia/packages/BenchmarkTools/7xSXH/src/execution.jl:99
 [19] #invokelatest#2
    @ ./essentials.jl:731 [inlined]
 [20] #run_result#45
    @ ~/.julia/packages/BenchmarkTools/7xSXH/src/execution.jl:34 [inlined]
 [21] run(b::BenchmarkTools.Benchmark, p::BenchmarkTools.Parameters; progressid::Nothing, nleaves::Float64, ndone::Float64, kwargs::Base.Pairs{Symbol, Integer, NTuple{5, Symbol}, NamedTuple{(:verbose, :samples, :evals, :gctrial, :gcsample), Tuple{Bool, Int64, Int64, Bool, Bool}}})
    @ BenchmarkTools ~/.julia/packages/BenchmarkTools/7xSXH/src/execution.jl:117
 [22] #warmup#54
    @ ~/.julia/packages/BenchmarkTools/7xSXH/src/execution.jl:169 [inlined]
 [23] warmup(item::BenchmarkTools.Benchmark)
    @ BenchmarkTools ~/.julia/packages/BenchmarkTools/7xSXH/src/execution.jl:168
 [24] top-level scope
    @ ~/.julia/packages/BenchmarkTools/7xSXH/src/execution.jl:393
 [25] top-level scope
    @ ~/.julia/packages/CUDA/DfvRa/src/initialization.jl:52

removing callback = CallbackSet() works though

Yes, please use that. I was checked out on development and hence made that change.

For A100 40GB:

Julia script above

Parameter number: 768000
Minimum time: 9.799917 ms
Allocs: 97

Okay yeah, so that's expected. A100's better than Tesla V100 and hence faster 14.6ms.
Let me generate scripts for plotting + testing against MPGOS, and I'll share them with you.

MPGOS w/ default makefile

Total simulation time:           14.429ms
Simulation time / 1000 RK4 step: 14.429ms
Ensemble size:                   768000

removing the architecture and maxregisters from the MPGOS makefile has virtual no impact on the results

I got similar results too and tried with sm_70 architecture as well. Can you run for 768 Ensemble size to check if we are getting that huge perf increase as shown in the plots?

Where do I set the ensemble for MPGOS?

Julia

Parameter number: 768
Minimum time: 0.196175 ms
Allocs: 46

MPGOS

Total simulation time:           3.115ms
Ensemble size:                   768

Ignore the link above. It seems like they have code for single and coupled systems. However, the MPGOS single system still doesn't match what you are simulating.

They are varying ρ not σ. Its also not clear from the code what their IC and time range is.

Below more explicitly matches the MPGOS ODE.

function lorenz(u, p, t)
    du1 = 10.0f0 * (u[2] - u[1])
    du2 = p*u[1] - u[2] - u[1]*u[3]
    du3 = u[1]*u[2] - 2.666f0 * u[3]
    return @SVector [du1, du2, du3]
end

u0 = @SVector [1.0f0; 0.0f0; 0.0f0]
tspan = (0.0f0, 10.0f0)
p = 28f0
prob = ODEProblem(lorenz, u0, tspan, p)

parameterList = range(0.0f0,stop = 21.0f0,length=numberOfParameters)

lorenzProblem =  ODEProblem(lorenz,u0,tspan,p)
prob_func = (prob, i, repeat) -> remake(prob, p = parameterList[i])

I also added tols to match as well

data = @benchmark @CUDA.sync DiffEqGPU.vectorized_solve($probs, $ensembleProb.prob, GPUSimpleTsit5();
save_everystep = false, dt = 0.01f0, abstol=1f-8, reltol=1f-8)

It doesn't really change the timing luckily

Parameter number: 768000
Minimum time: 9.704811 ms

I am not sure about that actually, but what they are trying to do is change unroll which sets the number of Lorenz problems to solve on a single thread. See here in the Ensemble case, (where they are setting trajectories as numberOfParameters/unroll): https://github.com/nnagyd/ode_solver_tests/blob/master/Lorenz_RK4/CPU_Julia/CPU_Lorenz_julia_ensemble.jl#L60
Hence, what we are simulating is unroll = 1, at least from what I can say in Julia.
However, I still need to understand more about what they are doing.

Yeah. thanks for that. The tolerance doesn't make sense in fix time-stepping, right? Use vectorized_asolve for the adaptive version.

I was not sure why they say Simulation time / 1000 RK4 step. It confused me to use adaptive = false case.

Why do you say fixed time? The stats spit out by MPGOS show the min/max time steps as

   Maximum time step:                         1e+06
   Minimum time step:                         1e-12

Because when they were using EnsembleGPUArray from Julia DiffEqGPU they were using adaptive = false https://github.com/nnagyd/ode_solver_tests/blob/master/Lorenz_RK4/GPU_Julia/GPU_Lorenz_simple_julia.jl#L77

Its hard to tell what they were doing with Julia. However since we are benchmarking against https://github.com/nnagyd/ode_solver_tests/tree/master/Lorenz_RK4/GPU_MPGOS we should try to match that.

changing to vectorized_asolve gives

data = @benchmark @CUDA.sync DiffEqGPU.vectorized_asolve($probs, $ensembleProb.prob, GPUSimpleTsit5();
       save_everystep = false, dt = 0.01f0, abstol=1f-8, reltol=1f-8)
ERROR: MethodError: no method matching vectorized_asolve(::CuArray{ODEProblem{SVector{3, Float32}, Tuple{Float32, Float32}, false, Float32, ODEFunction{false, SciMLBase.AutoSpecialize, typeof(lorenz), LinearAlgebra.UniformScaling{Bool}, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, typeof(SciMLBase.DEFAULT_OBSERVED), Nothing, Nothing}, Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}}, SciMLBase.StandardODEProblem}, 1, CUDA.Mem.DeviceBuffer}, ::ODEProblem{SVector{3, Float32}, Tuple{Float32, Float32}, false, Float32, ODEFunction{false, SciMLBase.AutoSpecialize, typeof(lorenz), LinearAlgebra.UniformScaling{Bool}, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, typeof(SciMLBase.DEFAULT_OBSERVED), Nothing, Nothing}, Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}}, SciMLBase.StandardODEProblem}, ::GPUSimpleTsit5; save_everystep=false, dt=0.01f0, abstol=1.0f-8, reltol=1.0f-8)
Closest candidates are:
  vectorized_asolve(::Any, ::ODEProblem, ::GPUSimpleATsit5; dt, saveat, save_everystep, abstol, reltol, debug, callback, tstops, kwargs...) at ~/.julia/packages/DiffEqGPU/CiiCq/src/solve.jl:57
Stacktrace:
  [1] macro expansion
    @ ~/.julia/packages/CUDA/DfvRa/src/utilities.jl:25 [inlined]
  [2] var"##core#397"(probs#395::CuArray{ODEProblem{SVector{3, Float32}, Tuple{Float32, Float32}, false, Float32, ODEFunction{false, SciMLBase.AutoSpecialize, typeof(lorenz), LinearAlgebra.UniformScaling{Bool}, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, typeof(SciMLBase.DEFAULT_OBSERVED), Nothing, Nothing}, Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}}, SciMLBase.StandardODEProblem}, 1, CUDA.Mem.DeviceBuffer}, ensembleProb#396::EnsembleProblem{ODEProblem{SVector{3, Float32}, Tuple{Float32, Float32}, false, Float32, ODEFunction{false, SciMLBase.AutoSpecialize, typeof(lorenz), LinearAlgebra.UniformScaling{Bool}, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, typeof(SciMLBase.DEFAULT_OBSERVED), Nothing, Nothing}, Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}}, SciMLBase.StandardODEProblem}, var"#25#26", typeof(SciMLBase.DEFAULT_OUTPUT_FUNC), typeof(SciMLBase.DEFAULT_REDUCTION), Nothing})
    @ Main ~/.julia/packages/BenchmarkTools/7xSXH/src/execution.jl:489
  [3] var"##sample#398"(::Tuple{CuArray{ODEProblem{SVector{3, Float32}, Tuple{Float32, Float32}, false, Float32, ODEFunction{false, SciMLBase.AutoSpecialize, typeof(lorenz), LinearAlgebra.UniformScaling{Bool}, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, typeof(SciMLBase.DEFAULT_OBSERVED), Nothing, Nothing}, Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}}, SciMLBase.StandardODEProblem}, 1, CUDA.Mem.DeviceBuffer}, EnsembleProblem{ODEProblem{SVector{3, Float32}, Tuple{Float32, Float32}, false, Float32, ODEFunction{false, SciMLBase.AutoSpecialize, typeof(lorenz), LinearAlgebra.UniformScaling{Bool}, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, typeof(SciMLBase.DEFAULT_OBSERVED), Nothing, Nothing}, Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}}, SciMLBase.StandardODEProblem}, var"#25#26", typeof(SciMLBase.DEFAULT_OUTPUT_FUNC), typeof(SciMLBase.DEFAULT_REDUCTION), Nothing}}, __params::BenchmarkTools.Parameters)
    @ Main ~/.julia/packages/BenchmarkTools/7xSXH/src/execution.jl:495
  [4] _run(b::BenchmarkTools.Benchmark, p::BenchmarkTools.Parameters; verbose::Bool, pad::String, kwargs::Base.Pairs{Symbol, Integer, NTuple{4, Symbol}, NamedTuple{(:samples, :evals, :gctrial, :gcsample), Tuple{Int64, Int64, Bool, Bool}}})
    @ BenchmarkTools ~/.julia/packages/BenchmarkTools/7xSXH/src/execution.jl:99
  [5] #invokelatest#2
    @ ./essentials.jl:731 [inlined]
  [6] #run_result#45
    @ ~/.julia/packages/BenchmarkTools/7xSXH/src/execution.jl:34 [inlined]
  [7] run(b::BenchmarkTools.Benchmark, p::BenchmarkTools.Parameters; progressid::Nothing, nleaves::Float64, ndone::Float64, kwargs::Base.Pairs{Symbol, Integer, NTuple{5, Symbol}, NamedTuple{(:verbose, :samples, :evals, :gctrial, :gcsample), Tuple{Bool, Int64, Int64, Bool, Bool}}})
    @ BenchmarkTools ~/.julia/packages/BenchmarkTools/7xSXH/src/execution.jl:117
  [8] #warmup#54
    @ ~/.julia/packages/BenchmarkTools/7xSXH/src/execution.jl:169 [inlined]
  [9] warmup(item::BenchmarkTools.Benchmark)
    @ BenchmarkTools ~/.julia/packages/BenchmarkTools/7xSXH/src/execution.jl:168
 [10] top-level scope
    @ ~/.julia/packages/BenchmarkTools/7xSXH/src/execution.jl:393

ah... needed GPUSimpleATsit5

julia> data = @benchmark @CUDA.sync DiffEqGPU.vectorized_solve($probs, $ensembleProb.prob, GPUSimpleTsit5();
           save_everystep = false, dt = 0.01f0, abstol=1f-8, reltol=1f-8)
BenchmarkTools.Trial: 507 samples with 1 evaluation.
 Range (min  max):  9.786 ms   23.002 ms  ┊ GC (min  max): 0.00%  0.00%
 Time  (median):     9.808 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   9.858 ms ± 649.728 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

    ▅▃█▇▄▃▁▄▃ ▂▁                                               
  ▅▇████████████▇▆▆▅▅▆▂▄▄▃▆▅▅▆▃▃▄▃▃▃▁▄▃▄▁▂▂▁▂▃▁▃▁▁▂▁▁▁▁▂▁▁▂▁▂ ▄
  9.79 ms         Histogram: frequency by time        9.92 ms <

 Memory estimate: 4.94 KiB, allocs estimate: 97.

julia> data = @benchmark @CUDA.sync DiffEqGPU.vectorized_asolve($probs, $ensembleProb.prob, GPUSimpleATsit5();
           save_everystep = false, dt = 0.01f0, abstol=1f-8, reltol=1f-8)
BenchmarkTools.Trial: 186 samples with 1 evaluation.
 Range (min  max):  26.867 ms  27.191 ms  ┊ GC (min  max): 0.00%  0.00%
 Time  (median):     26.899 ms              ┊ GC (median):    0.00%
 Time  (mean ± σ):   26.908 ms ± 44.145 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

    ▄▄▆▅█▁▁▂                                                   
  ▃▃████████▇▅▅▅▄▃▃▃▃▃▁▁▁▃▁▁▁▃▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▃▁▁▃ ▃
  26.9 ms         Histogram: frequency by time        27.2 ms <

 Memory estimate: 4.69 KiB, allocs estimate: 89.

Thanks for the results.
Okay, I think I got it. They are using two algorithms RK4 and RKCK45. The RK4 is the adaptive = false version, see here: https://github.com/nnagyd/ode_solver_tests/blob/master/Lorenz_RK4/GPU_MPGOS/SourceCodes/SingleSystem_PerThread_ExplicitRungeKutta_ErrorControllers.cuh#L6-L20. And they define to use RK4 here: https://github.com/nnagyd/ode_solver_tests/blob/master/Lorenz_RK4/GPU_MPGOS/Lorenz.cu#L15. Hence, I am pretty sure they are using adaptive = false case.
They printing tolerance with RK4 which is still very weird.

I feel there might be something different with our adaptive time-stepping as RKCK45 is faster than RK4.

I made some changes within the script: MPGOS is integrating till 1.0s. I am quite sure about adaptivity that RK4 is fixed-time step and RKCK45 is adaptive time-stepping. In any case, we are doing better now:

using DiffEqGPU, BenchmarkTools, StaticArrays, SimpleDiffEq
using CUDA

@show ARGS
#settings

numberOfParameters = isinteractive() ? 768000 : parse(Int64, ARGS[1])
gpuID = 0

device!(CuDevice(gpuID))
println("Running on "*string(CuDevice(gpuID)))

function lorenz(u, p, t)
    du1 = 10.0f0 * (u[2] - u[1])
    du2 = p*u[1] - u[2] - u[1]*u[3]
    du3 = u[1]*u[2] - 2.666f0 * u[3]
    return @SVector [du1, du2, du3]
end

u0 = @SVector [1.0f0; 0.0f0; 0.0f0]
tspan = (0.0f0, 1.0f0)
p = 28f0
prob = ODEProblem(lorenz, u0, tspan, p)

parameterList = range(0.0f0,stop = 21.0f0,length=numberOfParameters)

lorenzProblem =  ODEProblem(lorenz,u0,tspan,p)
prob_func = (prob, i, repeat) -> remake(prob, p = parameterList[i])

ensembleProb = EnsembleProblem(lorenzProblem,prob_func = prob_func)

## Building problems here only
I = 1:numberOfParameters
if ensembleProb.safetycopy
    probs = map(I) do i
        ensembleProb.prob_func(deepcopy(ensembleProb.prob), i, 1)
    end
else
    probs = map(I) do i
        ensembleProb.prob_func(ensembleProb.prob, i, 1)
    end
end

## Make them compatible with CUDA
probs = cu(probs)

@info "Solving the problem"
data = @benchmark @CUDA.sync DiffEqGPU.vectorized_solve($probs, $ensembleProb.prob, GPUSimpleTsit5();
save_everystep = false, dt = 0.001f0)



println("Parameter number: "*string(numberOfParameters))
println("Minimum time: "*string(minimum(data.times)/1e6)*" ms")
println("Allocs: "*string(data.allocs))


data = @benchmark @CUDA.sync DiffEqGPU.vectorized_asolve($probs, $ensembleProb.prob, GPUSimpleATsit5();
dt = 0.001f0,reltol = 1f-8,abstol=1f-8)


println("Parameter number: "*string(numberOfParameters))
println("Minimum time: "*string(minimum(data.times)/1e6)*" ms")
println("Allocs: "*string(data.allocs))

julia> data = @benchmark @CUDA.sync DiffEqGPU.vectorized_solve($probs, $ensembleProb.prob, GPUSimpleTsit5();
       save_everystep = false, dt = 0.001f0)
BenchmarkTools.Trial: 295 samples with 1 evaluation.
 Range (min … max):  14.925 ms … 36.507 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     15.416 ms              ┊ GC (median):    0.00%
 Time  (mean ± σ):   16.955 ms ±  4.546 ms  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▇█▅
  ████▇▅▅█▄▄▄▁▄▁▁▁▄▁▄▄▁▁▁▁▁▁▁▄▁▁▁▁▁▁▁▁▁▄▁▄▅▆▅▅▆▄▁▄▁▁▄▁▄▁▄▅▁▁▄ ▅
  14.9 ms      Histogram: log(frequency) by time      35.2 ms <

 Memory estimate: 5.16 KiB, allocs estimate: 107.

julia> println("Parameter number: "*string(numberOfParameters))
Parameter number: 768000

julia> println("Minimum time: "*string(minimum(data.times)/1e6)*" ms")
Minimum time: 14.925271 ms

julia> println("Allocs: "*string(data.allocs))
Allocs: 107

julia> data = @benchmark @CUDA.sync DiffEqGPU.vectorized_asolve($probs, $ensembleProb.prob, GPUSimpleATsit5();
       dt = 0.001f0,reltol = 1f-8,abstol=1f-8)
BenchmarkTools.Trial: 560 samples with 1 evaluation.
 Range (min … max):  7.059 ms … 42.164 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     7.723 ms              ┊ GC (median):    0.00%
 Time  (mean ± σ):   8.921 ms ±  4.485 ms  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▅█▆▂
  ████▇▅▇▅▅▁▄▄▄▄▅▁▅▄▁▄▄▁▁▁▁▅▁▁▁▁▁▁▁▆▆▁▅▁▁▄▄▁▁▁▁▄▄▁▄▄▄▄▁▅▁▄▄▄ ▆
  7.06 ms      Histogram: log(frequency) by time     29.1 ms <

 Memory estimate: 4.80 KiB, allocs estimate: 97.

RKCK45 is a Cash Karp method most comparable to Tsit5, not RK4

is minimum time the right measurement here? IMO throughout matters a lot more when you're on a gpu