Recent benchmarks of Kernelized GPU ODE solvers with MPGOS

Question

Recent benchmarks of Kernelized GPU ODE solvers with MPGOS

utkarsh530 opened this issue 2 years ago · 30 comments

Comparison of GPUTsit5 with MPGOS:
I used this script here (The author who compared MPGOS and Julia Stuff earlier) https://discourse.julialang.org/t/performance-problems-with-parallel-ensemble-simulation-on-gpu/34596 (taking minimum of simulation times), and it seems that we are currently faster than MPGOS,in terms of only the raw solve.
Threads = 768000
Batch size = 768000
GPUTsit5:

println("Minimum time: "*string(minimum(data.times)/1e6)*" ms")
Minimum time: 14.594967 ms

MPGOS:

Total simulation time:           16.749ms
Simulation time / 1000 RK4 step: 16.749ms
Ensemble size:                   768000

Here is the log-log plot of simulation times plotted with the number of trajectories.

Caveat:

Some of the issues with using DiffEqGPU API with GPUTsit5 is listed here: #171. So, to extract maximum performance from these solvers, I custom-generated ensemble problems as shown in the script below and call the lower-level API directly.

using OrdinaryDiffEq, DiffEqGPU, BenchmarkTools, StaticArrays, CPUTime, SimpleDiffEq
using CUDA

@show ARGS
#settings
numberOfParameters = 768000
gpuID = 0

device!(CuDevice(gpuID))
println("Running on "*string(CuDevice(gpuID)))

function lorenz(u, p, t)
    σ = p[1]
    ρ = p[2]
    β = p[3]
    du1 = σ * (u[2] - u[1])
    du2 = u[1] * (ρ - u[3]) - u[2]
    du3 = u[1] * u[2] - β * u[3]
    return @SVector [du1, du2, du3]
end

u0 = @SVector [1.0f0; 0.0f0; 0.0f0]
tspan = (0.0f0, 10.0f0)
p = @SVector [10.0f0, 28.0f0, 8 / 3.0f0]
prob = ODEProblem(lorenz, u0, tspan, p)

parameterList = range(0.0f0,stop = 21.0f0,length=numberOfParameters)

lorenzProblem =  ODEProblem(lorenz,u0,tspan,p)
#parameterChange = (prob,i,repeat) -> remake(prob,p=parameterList[i]) #it has slightly more allocations
prob_func = (prob, i, repeat) -> remake(prob, p = (@SVector [parameterList[i], p[2] ,p[3]]))

ensembleProb = EnsembleProblem(lorenzProblem,prob_func = prob_func)

## Building problems here only
I = 1:numberOfParameters
if ensembleProb.safetycopy
    probs = map(I) do i
        ensembleProb.prob_func(deepcopy(ensembleProb.prob), i, 1)
    end
else
    probs = map(I) do i
        ensembleProb.prob_func(ensembleProb.prob, i, 1)
    end
end

## Make them compatible with CUDA
probs = cu(probs)

@info "Solving the problem"
data = @benchmark @CUDA.sync DiffEqGPU.vectorized_solve($probs, $ensembleProb.prob, GPUSimpleTsit5();
save_everystep = false, dt = 0.01f0)


println("Parameter number: "*string(numberOfParameters))
println("Minimum time: "*string(minimum(data.times)/1e6)*" ms")
println("Allocs: "*string(data.allocs))

To be fair, vectorized_solve does really represent the exact solve time because benchmarking should not include conversion other extraneous overheads. DiffEqGPU provides ease for the users to provide generic API compatible with other stuff from DifferentialEquations.jl.

Possible Solution:

Write a completely different API, which will require to build the users to build problems on their own and pass those array of problems to GPUTsit5.

Answer 1 · 2022-10-26T17:01:40.000Z

The script used for MPGOS: https://github.com/nnagyd/ode_solver_tests/blob/master/Lorenz_RK4/GPU_MPGOS/Lorenz.cu

Answer 2 · 2022-10-26T17:02:49.000Z

I wonder if the architecture specified in their makefile is why we doing see better MPGOS results on the newer hardware. https://github.com/nnagyd/ode_solver_tests/blob/master/Lorenz_RK4/GPU_MPGOS/makefile

Answer 3 · 2022-10-26T17:04:23.000Z

It looks like your TESLA v100 is sm_70.

Answer 4 · 2022-10-26T19:52:53.000Z

I am testing on a clean OS with an A100 40 GB GPU and

(@v1.8) pkg> st
Status `~/.julia/environments/v1.8/Project.toml`
  [6e4b80f9] BenchmarkTools v1.3.1
  [a9c8d775] CPUTime v1.0.0
  [052768ef] CUDA v3.12.0
  [071ae1c0] DiffEqGPU v1.20.0
  [1dea7af3] OrdinaryDiffEq v6.29.3
  [05bca326] SimpleDiffEq v1.7.0
  [90137ffa] StaticArrays v1.5.9

and am getting the following during the DiffEqGPU.vectorized_solve

ERROR: InvalidIRError: compiling kernel #tsit5_kernel(CuDeviceVector{ODEProblem{SVector{3, Float32}, Tuple{Float32, Float32}, false, SVector{3, Float32}, ODEFunction{false, SciMLBase.AutoSpecialize, typeof(lorenz), LinearAlgebra.UniformScaling{Bool}, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, typeof(SciMLBase.DEFAULT_OBSERVED), Nothing, Nothing}, Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}}, SciMLBase.StandardODEProblem}, 1}, CuDeviceMatrix{SVector{3, Float32}, 1}, CuDeviceMatrix{Float32, 1}, Float32, CallbackSet{Tuple{}, Tuple{}}, Nothing, Int64, Nothing, Val{false}) resulted in invalid LLVM IR
Reason: unsupported dynamic function invocation (call to apply_discrete_callback!)
Stacktrace:
 [1] step!
   @ ~/.julia/packages/DiffEqGPU/CiiCq/src/perform_step/gpu_tsit5_perform_step.jl:60
 [2] tsit5_kernel
   @ ~/.julia/packages/DiffEqGPU/CiiCq/src/perform_step/gpu_tsit5_perform_step.jl:105
Hint: catch this exception as `err` and call `code_typed(err; interactive = true)` to introspect the erronous code with Cthulhu.jl
Stacktrace:
  [1] check_ir(job::GPUCompiler.CompilerJob{GPUCompiler.PTXCompilerTarget, CUDA.CUDACompilerParams, GPUCompiler.FunctionSpec{typeof(DiffEqGPU.tsit5_kernel), Tuple{CuDeviceVector{ODEProblem{SVector{3, Float32}, Tuple{Float32, Float32}, false, SVector{3, Float32}, ODEFunction{false, SciMLBase.AutoSpecialize, typeof(lorenz), LinearAlgebra.UniformScaling{Bool}, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, typeof(SciMLBase.DEFAULT_OBSERVED), Nothing, Nothing}, Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}}, SciMLBase.StandardODEProblem}, 1}, CuDeviceMatrix{SVector{3, Float32}, 1}, CuDeviceMatrix{Float32, 1}, Float32, CallbackSet{Tuple{}, Tuple{}}, Nothing, Int64, Nothing, Val{false}}}}, args::LLVM.Module)
    @ GPUCompiler ~/.julia/packages/GPUCompiler/6qAH4/src/validation.jl:141
  [2] macro expansion
    @ ~/.julia/packages/GPUCompiler/6qAH4/src/driver.jl:418 [inlined]
  [3] macro expansion
    @ ~/.julia/packages/TimerOutputs/4yHI4/src/TimerOutput.jl:253 [inlined]
  [4] macro expansion
    @ ~/.julia/packages/GPUCompiler/6qAH4/src/driver.jl:416 [inlined]
  [5] emit_asm(job::GPUCompiler.CompilerJob, ir::LLVM.Module; strip::Bool, validate::Bool, format::LLVM.API.LLVMCodeGenFileType)
    @ GPUCompiler ~/.julia/packages/GPUCompiler/6qAH4/src/utils.jl:68
  [6] cufunction_compile(job::GPUCompiler.CompilerJob, ctx::LLVM.Context)
    @ CUDA ~/.julia/packages/CUDA/DfvRa/src/compiler/execution.jl:354
  [7] #224
    @ ~/.julia/packages/CUDA/DfvRa/src/compiler/execution.jl:347 [inlined]
  [8] JuliaContext(f::CUDA.var"#224#225"{GPUCompiler.CompilerJob{GPUCompiler.PTXCompilerTarget, CUDA.CUDACompilerParams, GPUCompiler.FunctionSpec{typeof(DiffEqGPU.tsit5_kernel), Tuple{CuDeviceVector{ODEProblem{SVector{3, Float32}, Tuple{Float32, Float32}, false, SVector{3, Float32}, ODEFunction{false, SciMLBase.AutoSpecialize, typeof(lorenz), LinearAlgebra.UniformScaling{Bool}, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, typeof(SciMLBase.DEFAULT_OBSERVED), Nothing, Nothing}, Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}}, SciMLBase.StandardODEProblem}, 1}, CuDeviceMatrix{SVector{3, Float32}, 1}, CuDeviceMatrix{Float32, 1}, Float32, CallbackSet{Tuple{}, Tuple{}}, Nothing, Int64, Nothing, Val{false}}}}})
    @ GPUCompiler ~/.julia/packages/GPUCompiler/6qAH4/src/driver.jl:76
  [9] cufunction_compile(job::GPUCompiler.CompilerJob)
    @ CUDA ~/.julia/packages/CUDA/DfvRa/src/compiler/execution.jl:346
 [10] cached_compilation(cache::Dict{UInt64, Any}, job::GPUCompiler.CompilerJob, compiler::typeof(CUDA.cufunction_compile), linker::typeof(CUDA.cufunction_link))
    @ GPUCompiler ~/.julia/packages/GPUCompiler/6qAH4/src/cache.jl:90
 [11] cufunction(f::typeof(DiffEqGPU.tsit5_kernel), tt::Type{Tuple{CuDeviceVector{ODEProblem{SVector{3, Float32}, Tuple{Float32, Float32}, false, SVector{3, Float32}, ODEFunction{false, SciMLBase.AutoSpecialize, typeof(lorenz), LinearAlgebra.UniformScaling{Bool}, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, typeof(SciMLBase.DEFAULT_OBSERVED), Nothing, Nothing}, Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}}, SciMLBase.StandardODEProblem}, 1}, CuDeviceMatrix{SVector{3, Float32}, 1}, CuDeviceMatrix{Float32, 1}, Float32, CallbackSet{Tuple{}, Tuple{}}, Nothing, Int64, Nothing, Val{false}}}; name::Nothing, kwargs::Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}})
    @ CUDA ~/.julia/packages/CUDA/DfvRa/src/compiler/execution.jl:299
 [12] cufunction(f::typeof(DiffEqGPU.tsit5_kernel), tt::Type{Tuple{CuDeviceVector{ODEProblem{SVector{3, Float32}, Tuple{Float32, Float32}, false, SVector{3, Float32}, ODEFunction{false, SciMLBase.AutoSpecialize, typeof(lorenz), LinearAlgebra.UniformScaling{Bool}, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, typeof(SciMLBase.DEFAULT_OBSERVED), Nothing, Nothing}, Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}}, SciMLBase.StandardODEProblem}, 1}, CuDeviceMatrix{SVector{3, Float32}, 1}, CuDeviceMatrix{Float32, 1}, Float32, CallbackSet{Tuple{}, Tuple{}}, Nothing, Int64, Nothing, Val{false}}})
    @ CUDA ~/.julia/packages/CUDA/DfvRa/src/compiler/execution.jl:292
 [13] macro expansion
    @ ~/.julia/packages/CUDA/DfvRa/src/compiler/execution.jl:102 [inlined]
 [14] vectorized_solve(probs::CuArray{ODEProblem{SVector{3, Float32}, Tuple{Float32, Float32}, false, SVector{3, Float32}, ODEFunction{false, SciMLBase.AutoSpecialize, typeof(lorenz), LinearAlgebra.UniformScaling{Bool}, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, typeof(SciMLBase.DEFAULT_OBSERVED), Nothing, Nothing}, Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}}, SciMLBase.StandardODEProblem}, 1, CUDA.Mem.DeviceBuffer}, prob::ODEProblem{SVector{3, Float32}, Tuple{Float32, Float32}, false, SVector{3, Float32}, ODEFunction{false, SciMLBase.AutoSpecialize, typeof(lorenz), LinearAlgebra.UniformScaling{Bool}, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, typeof(SciMLBase.DEFAULT_OBSERVED), Nothing, Nothing}, Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}}, SciMLBase.StandardODEProblem}, alg::GPUSimpleTsit5; dt::Float32, saveat::Nothing, save_everystep::Bool, debug::Bool, callback::CallbackSet{Tuple{}, Tuple{}}, tstops::Nothing, kwargs::Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}})
    @ DiffEqGPU ~/.julia/packages/DiffEqGPU/CiiCq/src/solve.jl:34
 [15] macro expansion
    @ ~/.julia/packages/CUDA/DfvRa/src/utilities.jl:25 [inlined]
 [16] var"##core#315"(probs#313::CuArray{ODEProblem{SVector{3, Float32}, Tuple{Float32, Float32}, false, SVector{3, Float32}, ODEFunction{false, SciMLBase.AutoSpecialize, typeof(lorenz), LinearAlgebra.UniformScaling{Bool}, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, typeof(SciMLBase.DEFAULT_OBSERVED), Nothing, Nothing}, Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}}, SciMLBase.StandardODEProblem}, 1, CUDA.Mem.DeviceBuffer}, ensembleProb#314::EnsembleProblem{ODEProblem{SVector{3, Float32}, Tuple{Float32, Float32}, false, SVector{3, Float32}, ODEFunction{false, SciMLBase.AutoSpecialize, typeof(lorenz), LinearAlgebra.UniformScaling{Bool}, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, typeof(SciMLBase.DEFAULT_OBSERVED), Nothing, Nothing}, Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}}, SciMLBase.StandardODEProblem}, var"#1#2", typeof(SciMLBase.DEFAULT_OUTPUT_FUNC), typeof(SciMLBase.DEFAULT_REDUCTION), Nothing})
    @ Main ~/.julia/packages/BenchmarkTools/7xSXH/src/execution.jl:489
 [17] var"##sample#316"(::Tuple{CuArray{ODEProblem{SVector{3, Float32}, Tuple{Float32, Float32}, false, SVector{3, Float32}, ODEFunction{false, SciMLBase.AutoSpecialize, typeof(lorenz), LinearAlgebra.UniformScaling{Bool}, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, typeof(SciMLBase.DEFAULT_OBSERVED), Nothing, Nothing}, Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}}, SciMLBase.StandardODEProblem}, 1, CUDA.Mem.DeviceBuffer}, EnsembleProblem{ODEProblem{SVector{3, Float32}, Tuple{Float32, Float32}, false, SVector{3, Float32}, ODEFunction{false, SciMLBase.AutoSpecialize, typeof(lorenz), LinearAlgebra.UniformScaling{Bool}, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, typeof(SciMLBase.DEFAULT_OBSERVED), Nothing, Nothing}, Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}}, SciMLBase.StandardODEProblem}, var"#1#2", typeof(SciMLBase.DEFAULT_OUTPUT_FUNC), typeof(SciMLBase.DEFAULT_REDUCTION), Nothing}}, __params::BenchmarkTools.Parameters)
    @ Main ~/.julia/packages/BenchmarkTools/7xSXH/src/execution.jl:495
 [18] _run(b::BenchmarkTools.Benchmark, p::BenchmarkTools.Parameters; verbose::Bool, pad::String, kwargs::Base.Pairs{Symbol, Integer, NTuple{4, Symbol}, NamedTuple{(:samples, :evals, :gctrial, :gcsample), Tuple{Int64, Int64, Bool, Bool}}})
    @ BenchmarkTools ~/.julia/packages/BenchmarkTools/7xSXH/src/execution.jl:99
 [19] #invokelatest#2
    @ ./essentials.jl:731 [inlined]
 [20] #run_result#45
    @ ~/.julia/packages/BenchmarkTools/7xSXH/src/execution.jl:34 [inlined]
 [21] run(b::BenchmarkTools.Benchmark, p::BenchmarkTools.Parameters; progressid::Nothing, nleaves::Float64, ndone::Float64, kwargs::Base.Pairs{Symbol, Integer, NTuple{5, Symbol}, NamedTuple{(:verbose, :samples, :evals, :gctrial, :gcsample), Tuple{Bool, Int64, Int64, Bool, Bool}}})
    @ BenchmarkTools ~/.julia/packages/BenchmarkTools/7xSXH/src/execution.jl:117
 [22] #warmup#54
    @ ~/.julia/packages/BenchmarkTools/7xSXH/src/execution.jl:169 [inlined]
 [23] warmup(item::BenchmarkTools.Benchmark)
    @ BenchmarkTools ~/.julia/packages/BenchmarkTools/7xSXH/src/execution.jl:168
 [24] top-level scope
    @ ~/.julia/packages/BenchmarkTools/7xSXH/src/execution.jl:393
 [25] top-level scope
    @ ~/.julia/packages/CUDA/DfvRa/src/initialization.jl:52

Answer 5 · 2022-10-26T19:59:28.000Z

removing callback = CallbackSet() works though

Answer 6 · 2022-10-26T20:00:33.000Z

Yes, please use that. I was checked out on development and hence made that change.

Answer 7 · 2022-10-26T20:01:39.000Z

For A100 40GB:

Julia script above

Parameter number: 768000
Minimum time: 9.799917 ms
Allocs: 97

Answer 8 · 2022-10-26T20:04:23.000Z

Okay yeah, so that's expected. A100's better than Tesla V100 and hence faster 14.6ms.
Let me generate scripts for plotting + testing against MPGOS, and I'll share them with you.

Answer 9 · 2022-10-26T20:09:59.000Z

MPGOS w/ default makefile

Total simulation time:           14.429ms
Simulation time / 1000 RK4 step: 14.429ms
Ensemble size:                   768000

removing the architecture and maxregisters from the MPGOS makefile has virtual no impact on the results

Answer 10 · 2022-10-26T20:12:33.000Z

I got similar results too and tried with sm_70 architecture as well. Can you run for 768 Ensemble size to check if we are getting that huge perf increase as shown in the plots?

Answer 11 · 2022-10-26T20:14:11.000Z

Where do I set the ensemble for MPGOS?

Answer 12 · 2022-10-26T20:15:49.000Z

I think it is this: https://github.com/nnagyd/ode_solver_tests/blob/master/Lorenz_RK4/GPU_MPGOS/Lorenz.cu#L17
And do:

make clean
make

Answer 13 · 2022-10-26T20:17:54.000Z

Julia

Parameter number: 768
Minimum time: 0.196175 ms
Allocs: 46

MPGOS

Total simulation time:           3.115ms
Ensemble size:                   768

Answer 14 · 2022-10-26T20:24:46.000Z

@ChrisRackauckas

Answer 15 · 2022-10-26T20:36:07.000Z

@utkarsh530 Are you sure we are simulating the same things as MPGOS? https://github.com/nnagyd/ode_solver_tests/blob/master/Lorenz_RK4/CPU_Julia/CPU_Lorenz_julia_ensemble.jl has a 384 dim ODE.

Answer 16 · 2022-10-26T21:00:54.000Z

Ignore the link above. It seems like they have code for single and coupled systems. However, the MPGOS single system still doesn't match what you are simulating.

They are varying ρ not σ. Its also not clear from the code what their IC and time range is.

Below more explicitly matches the MPGOS ODE.

function lorenz(u, p, t)
    du1 = 10.0f0 * (u[2] - u[1])
    du2 = p*u[1] - u[2] - u[1]*u[3]
    du3 = u[1]*u[2] - 2.666f0 * u[3]
    return @SVector [du1, du2, du3]
end

u0 = @SVector [1.0f0; 0.0f0; 0.0f0]
tspan = (0.0f0, 10.0f0)
p = 28f0
prob = ODEProblem(lorenz, u0, tspan, p)

parameterList = range(0.0f0,stop = 21.0f0,length=numberOfParameters)

lorenzProblem =  ODEProblem(lorenz,u0,tspan,p)
prob_func = (prob, i, repeat) -> remake(prob, p = parameterList[i])

I also added tols to match as well

data = @benchmark @CUDA.sync DiffEqGPU.vectorized_solve($probs, $ensembleProb.prob, GPUSimpleTsit5();
save_everystep = false, dt = 0.01f0, abstol=1f-8, reltol=1f-8)

It doesn't really change the timing luckily

Parameter number: 768000
Minimum time: 9.704811 ms

Answer 17 · 2022-10-26T21:05:18.000Z

I am not sure about that actually, but what they are trying to do is change unroll which sets the number of Lorenz problems to solve on a single thread. See here in the Ensemble case, (where they are setting trajectories as numberOfParameters/unroll): https://github.com/nnagyd/ode_solver_tests/blob/master/Lorenz_RK4/CPU_Julia/CPU_Lorenz_julia_ensemble.jl#L60
Hence, what we are simulating is unroll = 1, at least from what I can say in Julia.
However, I still need to understand more about what they are doing.

Answer 18 · 2022-10-26T21:07:23.000Z

Yeah. thanks for that. The tolerance doesn't make sense in fix time-stepping, right? Use vectorized_asolve for the adaptive version.

Answer 19 · 2022-10-26T21:12:54.000Z

I was not sure why they say Simulation time / 1000 RK4 step. It confused me to use adaptive = false case.

Answer 20 · 2022-10-26T21:15:21.000Z

Why do you say fixed time? The stats spit out by MPGOS show the min/max time steps as

   Maximum time step:                         1e+06
   Minimum time step:                         1e-12

Answer 21 · 2022-10-26T21:16:41.000Z

Because when they were using EnsembleGPUArray from Julia DiffEqGPU they were using adaptive = false https://github.com/nnagyd/ode_solver_tests/blob/master/Lorenz_RK4/GPU_Julia/GPU_Lorenz_simple_julia.jl#L77

Answer 22 · 2022-10-26T21:18:00.000Z

Its hard to tell what they were doing with Julia. However since we are benchmarking against https://github.com/nnagyd/ode_solver_tests/tree/master/Lorenz_RK4/GPU_MPGOS we should try to match that.

Answer 23 · 2022-10-26T21:19:07.000Z

changing to vectorized_asolve gives

data = @benchmark @CUDA.sync DiffEqGPU.vectorized_asolve($probs, $ensembleProb.prob, GPUSimpleTsit5();
       save_everystep = false, dt = 0.01f0, abstol=1f-8, reltol=1f-8)
ERROR: MethodError: no method matching vectorized_asolve(::CuArray{ODEProblem{SVector{3, Float32}, Tuple{Float32, Float32}, false, Float32, ODEFunction{false, SciMLBase.AutoSpecialize, typeof(lorenz), LinearAlgebra.UniformScaling{Bool}, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, typeof(SciMLBase.DEFAULT_OBSERVED), Nothing, Nothing}, Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}}, SciMLBase.StandardODEProblem}, 1, CUDA.Mem.DeviceBuffer}, ::ODEProblem{SVector{3, Float32}, Tuple{Float32, Float32}, false, Float32, ODEFunction{false, SciMLBase.AutoSpecialize, typeof(lorenz), LinearAlgebra.UniformScaling{Bool}, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, typeof(SciMLBase.DEFAULT_OBSERVED), Nothing, Nothing}, Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}}, SciMLBase.StandardODEProblem}, ::GPUSimpleTsit5; save_everystep=false, dt=0.01f0, abstol=1.0f-8, reltol=1.0f-8)
Closest candidates are:
  vectorized_asolve(::Any, ::ODEProblem, ::GPUSimpleATsit5; dt, saveat, save_everystep, abstol, reltol, debug, callback, tstops, kwargs...) at ~/.julia/packages/DiffEqGPU/CiiCq/src/solve.jl:57
Stacktrace:
  [1] macro expansion
    @ ~/.julia/packages/CUDA/DfvRa/src/utilities.jl:25 [inlined]
  [2] var"##core#397"(probs#395::CuArray{ODEProblem{SVector{3, Float32}, Tuple{Float32, Float32}, false, Float32, ODEFunction{false, SciMLBase.AutoSpecialize, typeof(lorenz), LinearAlgebra.UniformScaling{Bool}, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, typeof(SciMLBase.DEFAULT_OBSERVED), Nothing, Nothing}, Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}}, SciMLBase.StandardODEProblem}, 1, CUDA.Mem.DeviceBuffer}, ensembleProb#396::EnsembleProblem{ODEProblem{SVector{3, Float32}, Tuple{Float32, Float32}, false, Float32, ODEFunction{false, SciMLBase.AutoSpecialize, typeof(lorenz), LinearAlgebra.UniformScaling{Bool}, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, typeof(SciMLBase.DEFAULT_OBSERVED), Nothing, Nothing}, Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}}, SciMLBase.StandardODEProblem}, var"#25#26", typeof(SciMLBase.DEFAULT_OUTPUT_FUNC), typeof(SciMLBase.DEFAULT_REDUCTION), Nothing})
    @ Main ~/.julia/packages/BenchmarkTools/7xSXH/src/execution.jl:489
  [3] var"##sample#398"(::Tuple{CuArray{ODEProblem{SVector{3, Float32}, Tuple{Float32, Float32}, false, Float32, ODEFunction{false, SciMLBase.AutoSpecialize, typeof(lorenz), LinearAlgebra.UniformScaling{Bool}, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, typeof(SciMLBase.DEFAULT_OBSERVED), Nothing, Nothing}, Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}}, SciMLBase.StandardODEProblem}, 1, CUDA.Mem.DeviceBuffer}, EnsembleProblem{ODEProblem{SVector{3, Float32}, Tuple{Float32, Float32}, false, Float32, ODEFunction{false, SciMLBase.AutoSpecialize, typeof(lorenz), LinearAlgebra.UniformScaling{Bool}, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, typeof(SciMLBase.DEFAULT_OBSERVED), Nothing, Nothing}, Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}}, SciMLBase.StandardODEProblem}, var"#25#26", typeof(SciMLBase.DEFAULT_OUTPUT_FUNC), typeof(SciMLBase.DEFAULT_REDUCTION), Nothing}}, __params::BenchmarkTools.Parameters)
    @ Main ~/.julia/packages/BenchmarkTools/7xSXH/src/execution.jl:495
  [4] _run(b::BenchmarkTools.Benchmark, p::BenchmarkTools.Parameters; verbose::Bool, pad::String, kwargs::Base.Pairs{Symbol, Integer, NTuple{4, Symbol}, NamedTuple{(:samples, :evals, :gctrial, :gcsample), Tuple{Int64, Int64, Bool, Bool}}})
    @ BenchmarkTools ~/.julia/packages/BenchmarkTools/7xSXH/src/execution.jl:99
  [5] #invokelatest#2
    @ ./essentials.jl:731 [inlined]
  [6] #run_result#45
    @ ~/.julia/packages/BenchmarkTools/7xSXH/src/execution.jl:34 [inlined]
  [7] run(b::BenchmarkTools.Benchmark, p::BenchmarkTools.Parameters; progressid::Nothing, nleaves::Float64, ndone::Float64, kwargs::Base.Pairs{Symbol, Integer, NTuple{5, Symbol}, NamedTuple{(:verbose, :samples, :evals, :gctrial, :gcsample), Tuple{Bool, Int64, Int64, Bool, Bool}}})
    @ BenchmarkTools ~/.julia/packages/BenchmarkTools/7xSXH/src/execution.jl:117
  [8] #warmup#54
    @ ~/.julia/packages/BenchmarkTools/7xSXH/src/execution.jl:169 [inlined]
  [9] warmup(item::BenchmarkTools.Benchmark)
    @ BenchmarkTools ~/.julia/packages/BenchmarkTools/7xSXH/src/execution.jl:168
 [10] top-level scope
    @ ~/.julia/packages/BenchmarkTools/7xSXH/src/execution.jl:393

Answer 24 · 2022-10-26T21:22:30.000Z

ah... needed GPUSimpleATsit5

julia> data = @benchmark @CUDA.sync DiffEqGPU.vectorized_solve($probs, $ensembleProb.prob, GPUSimpleTsit5();
           save_everystep = false, dt = 0.01f0, abstol=1f-8, reltol=1f-8)
BenchmarkTools.Trial: 507 samples with 1 evaluation.
 Range (min … max):  9.786 ms …  23.002 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     9.808 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   9.858 ms ± 649.728 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

    ▅▃█▇▄▃▁▄▃ ▂▁                                               
  ▅▇████████████▇▆▆▅▅▆▂▄▄▃▆▅▅▆▃▃▄▃▃▃▁▄▃▄▁▂▂▁▂▃▁▃▁▁▂▁▁▁▁▂▁▁▂▁▂ ▄
  9.79 ms         Histogram: frequency by time        9.92 ms <

 Memory estimate: 4.94 KiB, allocs estimate: 97.

julia> data = @benchmark @CUDA.sync DiffEqGPU.vectorized_asolve($probs, $ensembleProb.prob, GPUSimpleATsit5();
           save_everystep = false, dt = 0.01f0, abstol=1f-8, reltol=1f-8)
BenchmarkTools.Trial: 186 samples with 1 evaluation.
 Range (min … max):  26.867 ms … 27.191 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     26.899 ms              ┊ GC (median):    0.00%
 Time  (mean ± σ):   26.908 ms ± 44.145 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

    ▄▄▆▅█▁▁▂                                                   
  ▃▃████████▇▅▅▅▄▃▃▃▃▃▁▁▁▃▁▁▁▃▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▃▁▁▃ ▃
  26.9 ms         Histogram: frequency by time        27.2 ms <

 Memory estimate: 4.69 KiB, allocs estimate: 89.

Answer 25 · 2022-10-26T21:43:35.000Z

Thanks for the results.
Okay, I think I got it. They are using two algorithms RK4 and RKCK45. The RK4 is the adaptive = false version, see here: https://github.com/nnagyd/ode_solver_tests/blob/master/Lorenz_RK4/GPU_MPGOS/SourceCodes/SingleSystem_PerThread_ExplicitRungeKutta_ErrorControllers.cuh#L6-L20. And they define to use RK4 here: https://github.com/nnagyd/ode_solver_tests/blob/master/Lorenz_RK4/GPU_MPGOS/Lorenz.cu#L15. Hence, I am pretty sure they are using adaptive = false case.
They printing tolerance with RK4 which is still very weird.

Answer 26 · 2022-10-26T22:05:09.000Z

I feel there might be something different with our adaptive time-stepping as RKCK45 is faster than RK4.

Answer 27 · 2022-10-26T22:22:42.000Z

I made some changes within the script: MPGOS is integrating till 1.0s. I am quite sure about adaptivity that RK4 is fixed-time step and RKCK45 is adaptive time-stepping. In any case, we are doing better now:

using DiffEqGPU, BenchmarkTools, StaticArrays, SimpleDiffEq
using CUDA

@show ARGS
#settings

numberOfParameters = isinteractive() ? 768000 : parse(Int64, ARGS[1])
gpuID = 0

device!(CuDevice(gpuID))
println("Running on "*string(CuDevice(gpuID)))

function lorenz(u, p, t)
    du1 = 10.0f0 * (u[2] - u[1])
    du2 = p*u[1] - u[2] - u[1]*u[3]
    du3 = u[1]*u[2] - 2.666f0 * u[3]
    return @SVector [du1, du2, du3]
end

u0 = @SVector [1.0f0; 0.0f0; 0.0f0]
tspan = (0.0f0, 1.0f0)
p = 28f0
prob = ODEProblem(lorenz, u0, tspan, p)

parameterList = range(0.0f0,stop = 21.0f0,length=numberOfParameters)

lorenzProblem =  ODEProblem(lorenz,u0,tspan,p)
prob_func = (prob, i, repeat) -> remake(prob, p = parameterList[i])

ensembleProb = EnsembleProblem(lorenzProblem,prob_func = prob_func)

## Building problems here only
I = 1:numberOfParameters
if ensembleProb.safetycopy
    probs = map(I) do i
        ensembleProb.prob_func(deepcopy(ensembleProb.prob), i, 1)
    end
else
    probs = map(I) do i
        ensembleProb.prob_func(ensembleProb.prob, i, 1)
    end
end

## Make them compatible with CUDA
probs = cu(probs)

@info "Solving the problem"
data = @benchmark @CUDA.sync DiffEqGPU.vectorized_solve($probs, $ensembleProb.prob, GPUSimpleTsit5();
save_everystep = false, dt = 0.001f0)



println("Parameter number: "*string(numberOfParameters))
println("Minimum time: "*string(minimum(data.times)/1e6)*" ms")
println("Allocs: "*string(data.allocs))


data = @benchmark @CUDA.sync DiffEqGPU.vectorized_asolve($probs, $ensembleProb.prob, GPUSimpleATsit5();
dt = 0.001f0,reltol = 1f-8,abstol=1f-8)


println("Parameter number: "*string(numberOfParameters))
println("Minimum time: "*string(minimum(data.times)/1e6)*" ms")
println("Allocs: "*string(data.allocs))

julia> data = @benchmark @CUDA.sync DiffEqGPU.vectorized_solve($probs, $ensembleProb.prob, GPUSimpleTsit5();
       save_everystep = false, dt = 0.001f0)
BenchmarkTools.Trial: 295 samples with 1 evaluation.
 Range (min … max):  14.925 ms … 36.507 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     15.416 ms              ┊ GC (median):    0.00%
 Time  (mean ± σ):   16.955 ms ±  4.546 ms  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▇█▅
  ████▇▅▅█▄▄▄▁▄▁▁▁▄▁▄▄▁▁▁▁▁▁▁▄▁▁▁▁▁▁▁▁▁▄▁▄▅▆▅▅▆▄▁▄▁▁▄▁▄▁▄▅▁▁▄ ▅
  14.9 ms      Histogram: log(frequency) by time      35.2 ms <

 Memory estimate: 5.16 KiB, allocs estimate: 107.

julia> println("Parameter number: "*string(numberOfParameters))
Parameter number: 768000

julia> println("Minimum time: "*string(minimum(data.times)/1e6)*" ms")
Minimum time: 14.925271 ms

julia> println("Allocs: "*string(data.allocs))
Allocs: 107

julia> data = @benchmark @CUDA.sync DiffEqGPU.vectorized_asolve($probs, $ensembleProb.prob, GPUSimpleATsit5();
       dt = 0.001f0,reltol = 1f-8,abstol=1f-8)
BenchmarkTools.Trial: 560 samples with 1 evaluation.
 Range (min … max):  7.059 ms … 42.164 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     7.723 ms              ┊ GC (median):    0.00%
 Time  (mean ± σ):   8.921 ms ±  4.485 ms  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▅█▆▂
  ████▇▅▇▅▅▁▄▄▄▄▅▁▅▄▁▄▄▁▁▁▁▅▁▁▁▁▁▁▁▆▆▁▅▁▁▄▄▁▁▁▁▄▄▁▄▄▄▄▁▅▁▄▄▄ ▆
  7.06 ms      Histogram: log(frequency) by time     29.1 ms <

 Memory estimate: 4.80 KiB, allocs estimate: 97.

Answer 28 · 2022-10-27T08:48:46.000Z

RKCK45 is a Cash Karp method most comparable to Tsit5, not RK4

Answer 29 · 2022-12-13T13:05:31.000Z

is minimum time the right measurement here? IMO throughout matters a lot more when you're on a gpu

Answer 30 · 2023-04-21T09:14:11.000Z

Superceded by https://github.com/utkarsh530/GPUODEBenchmarks