Recent benchmarks of Kernelized GPU ODE solvers with MPGOS
utkarsh530 opened this issue · 30 comments
Comparison of GPUTsit5 with MPGOS:
I used this script here (The author who compared MPGOS and Julia Stuff earlier) https://discourse.julialang.org/t/performance-problems-with-parallel-ensemble-simulation-on-gpu/34596 (taking minimum of simulation times), and it seems that we are currently faster than MPGOS,in terms of only the raw solve.
Threads = 768000
Batch size = 768000
GPUTsit5:
println("Minimum time: "*string(minimum(data.times)/1e6)*" ms")
Minimum time: 14.594967 ms
MPGOS:
Total simulation time: 16.749ms
Simulation time / 1000 RK4 step: 16.749ms
Ensemble size: 768000
Here is the log-log plot of simulation times plotted with the number of trajectories.
Caveat:
Some of the issues with using DiffEqGPU API with GPUTsit5
is listed here: #171. So, to extract maximum performance from these solvers, I custom-generated ensemble problems as shown in the script below and call the lower-level API directly.
using OrdinaryDiffEq, DiffEqGPU, BenchmarkTools, StaticArrays, CPUTime, SimpleDiffEq
using CUDA
@show ARGS
#settings
numberOfParameters = 768000
gpuID = 0
device!(CuDevice(gpuID))
println("Running on "*string(CuDevice(gpuID)))
function lorenz(u, p, t)
σ = p[1]
ρ = p[2]
β = p[3]
du1 = σ * (u[2] - u[1])
du2 = u[1] * (ρ - u[3]) - u[2]
du3 = u[1] * u[2] - β * u[3]
return @SVector [du1, du2, du3]
end
u0 = @SVector [1.0f0; 0.0f0; 0.0f0]
tspan = (0.0f0, 10.0f0)
p = @SVector [10.0f0, 28.0f0, 8 / 3.0f0]
prob = ODEProblem(lorenz, u0, tspan, p)
parameterList = range(0.0f0,stop = 21.0f0,length=numberOfParameters)
lorenzProblem = ODEProblem(lorenz,u0,tspan,p)
#parameterChange = (prob,i,repeat) -> remake(prob,p=parameterList[i]) #it has slightly more allocations
prob_func = (prob, i, repeat) -> remake(prob, p = (@SVector [parameterList[i], p[2] ,p[3]]))
ensembleProb = EnsembleProblem(lorenzProblem,prob_func = prob_func)
## Building problems here only
I = 1:numberOfParameters
if ensembleProb.safetycopy
probs = map(I) do i
ensembleProb.prob_func(deepcopy(ensembleProb.prob), i, 1)
end
else
probs = map(I) do i
ensembleProb.prob_func(ensembleProb.prob, i, 1)
end
end
## Make them compatible with CUDA
probs = cu(probs)
@info "Solving the problem"
data = @benchmark @CUDA.sync DiffEqGPU.vectorized_solve($probs, $ensembleProb.prob, GPUSimpleTsit5();
save_everystep = false, dt = 0.01f0)
println("Parameter number: "*string(numberOfParameters))
println("Minimum time: "*string(minimum(data.times)/1e6)*" ms")
println("Allocs: "*string(data.allocs))
To be fair, vectorized_solve
does really represent the exact solve time because benchmarking should not include conversion other extraneous overheads. DiffEqGPU
provides ease for the users to provide generic API compatible with other stuff from DifferentialEquations.jl
.
Possible Solution:
Write a completely different API, which will require to build the users to build problems on their own and pass those array of problems to GPUTsit5
.
The script used for MPGOS: https://github.com/nnagyd/ode_solver_tests/blob/master/Lorenz_RK4/GPU_MPGOS/Lorenz.cu
I wonder if the architecture specified in their makefile is why we doing see better MPGOS results on the newer hardware. https://github.com/nnagyd/ode_solver_tests/blob/master/Lorenz_RK4/GPU_MPGOS/makefile
It looks like your TESLA v100 is sm_70
.
I am testing on a clean OS with an A100 40 GB GPU and
(@v1.8) pkg> st
Status `~/.julia/environments/v1.8/Project.toml`
[6e4b80f9] BenchmarkTools v1.3.1
[a9c8d775] CPUTime v1.0.0
[052768ef] CUDA v3.12.0
[071ae1c0] DiffEqGPU v1.20.0
[1dea7af3] OrdinaryDiffEq v6.29.3
[05bca326] SimpleDiffEq v1.7.0
[90137ffa] StaticArrays v1.5.9
and am getting the following during the DiffEqGPU.vectorized_solve
ERROR: InvalidIRError: compiling kernel #tsit5_kernel(CuDeviceVector{ODEProblem{SVector{3, Float32}, Tuple{Float32, Float32}, false, SVector{3, Float32}, ODEFunction{false, SciMLBase.AutoSpecialize, typeof(lorenz), LinearAlgebra.UniformScaling{Bool}, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, typeof(SciMLBase.DEFAULT_OBSERVED), Nothing, Nothing}, Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}}, SciMLBase.StandardODEProblem}, 1}, CuDeviceMatrix{SVector{3, Float32}, 1}, CuDeviceMatrix{Float32, 1}, Float32, CallbackSet{Tuple{}, Tuple{}}, Nothing, Int64, Nothing, Val{false}) resulted in invalid LLVM IR
Reason: unsupported dynamic function invocation (call to apply_discrete_callback!)
Stacktrace:
[1] step!
@ ~/.julia/packages/DiffEqGPU/CiiCq/src/perform_step/gpu_tsit5_perform_step.jl:60
[2] tsit5_kernel
@ ~/.julia/packages/DiffEqGPU/CiiCq/src/perform_step/gpu_tsit5_perform_step.jl:105
Hint: catch this exception as `err` and call `code_typed(err; interactive = true)` to introspect the erronous code with Cthulhu.jl
Stacktrace:
[1] check_ir(job::GPUCompiler.CompilerJob{GPUCompiler.PTXCompilerTarget, CUDA.CUDACompilerParams, GPUCompiler.FunctionSpec{typeof(DiffEqGPU.tsit5_kernel), Tuple{CuDeviceVector{ODEProblem{SVector{3, Float32}, Tuple{Float32, Float32}, false, SVector{3, Float32}, ODEFunction{false, SciMLBase.AutoSpecialize, typeof(lorenz), LinearAlgebra.UniformScaling{Bool}, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, typeof(SciMLBase.DEFAULT_OBSERVED), Nothing, Nothing}, Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}}, SciMLBase.StandardODEProblem}, 1}, CuDeviceMatrix{SVector{3, Float32}, 1}, CuDeviceMatrix{Float32, 1}, Float32, CallbackSet{Tuple{}, Tuple{}}, Nothing, Int64, Nothing, Val{false}}}}, args::LLVM.Module)
@ GPUCompiler ~/.julia/packages/GPUCompiler/6qAH4/src/validation.jl:141
[2] macro expansion
@ ~/.julia/packages/GPUCompiler/6qAH4/src/driver.jl:418 [inlined]
[3] macro expansion
@ ~/.julia/packages/TimerOutputs/4yHI4/src/TimerOutput.jl:253 [inlined]
[4] macro expansion
@ ~/.julia/packages/GPUCompiler/6qAH4/src/driver.jl:416 [inlined]
[5] emit_asm(job::GPUCompiler.CompilerJob, ir::LLVM.Module; strip::Bool, validate::Bool, format::LLVM.API.LLVMCodeGenFileType)
@ GPUCompiler ~/.julia/packages/GPUCompiler/6qAH4/src/utils.jl:68
[6] cufunction_compile(job::GPUCompiler.CompilerJob, ctx::LLVM.Context)
@ CUDA ~/.julia/packages/CUDA/DfvRa/src/compiler/execution.jl:354
[7] #224
@ ~/.julia/packages/CUDA/DfvRa/src/compiler/execution.jl:347 [inlined]
[8] JuliaContext(f::CUDA.var"#224#225"{GPUCompiler.CompilerJob{GPUCompiler.PTXCompilerTarget, CUDA.CUDACompilerParams, GPUCompiler.FunctionSpec{typeof(DiffEqGPU.tsit5_kernel), Tuple{CuDeviceVector{ODEProblem{SVector{3, Float32}, Tuple{Float32, Float32}, false, SVector{3, Float32}, ODEFunction{false, SciMLBase.AutoSpecialize, typeof(lorenz), LinearAlgebra.UniformScaling{Bool}, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, typeof(SciMLBase.DEFAULT_OBSERVED), Nothing, Nothing}, Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}}, SciMLBase.StandardODEProblem}, 1}, CuDeviceMatrix{SVector{3, Float32}, 1}, CuDeviceMatrix{Float32, 1}, Float32, CallbackSet{Tuple{}, Tuple{}}, Nothing, Int64, Nothing, Val{false}}}}})
@ GPUCompiler ~/.julia/packages/GPUCompiler/6qAH4/src/driver.jl:76
[9] cufunction_compile(job::GPUCompiler.CompilerJob)
@ CUDA ~/.julia/packages/CUDA/DfvRa/src/compiler/execution.jl:346
[10] cached_compilation(cache::Dict{UInt64, Any}, job::GPUCompiler.CompilerJob, compiler::typeof(CUDA.cufunction_compile), linker::typeof(CUDA.cufunction_link))
@ GPUCompiler ~/.julia/packages/GPUCompiler/6qAH4/src/cache.jl:90
[11] cufunction(f::typeof(DiffEqGPU.tsit5_kernel), tt::Type{Tuple{CuDeviceVector{ODEProblem{SVector{3, Float32}, Tuple{Float32, Float32}, false, SVector{3, Float32}, ODEFunction{false, SciMLBase.AutoSpecialize, typeof(lorenz), LinearAlgebra.UniformScaling{Bool}, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, typeof(SciMLBase.DEFAULT_OBSERVED), Nothing, Nothing}, Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}}, SciMLBase.StandardODEProblem}, 1}, CuDeviceMatrix{SVector{3, Float32}, 1}, CuDeviceMatrix{Float32, 1}, Float32, CallbackSet{Tuple{}, Tuple{}}, Nothing, Int64, Nothing, Val{false}}}; name::Nothing, kwargs::Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}})
@ CUDA ~/.julia/packages/CUDA/DfvRa/src/compiler/execution.jl:299
[12] cufunction(f::typeof(DiffEqGPU.tsit5_kernel), tt::Type{Tuple{CuDeviceVector{ODEProblem{SVector{3, Float32}, Tuple{Float32, Float32}, false, SVector{3, Float32}, ODEFunction{false, SciMLBase.AutoSpecialize, typeof(lorenz), LinearAlgebra.UniformScaling{Bool}, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, typeof(SciMLBase.DEFAULT_OBSERVED), Nothing, Nothing}, Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}}, SciMLBase.StandardODEProblem}, 1}, CuDeviceMatrix{SVector{3, Float32}, 1}, CuDeviceMatrix{Float32, 1}, Float32, CallbackSet{Tuple{}, Tuple{}}, Nothing, Int64, Nothing, Val{false}}})
@ CUDA ~/.julia/packages/CUDA/DfvRa/src/compiler/execution.jl:292
[13] macro expansion
@ ~/.julia/packages/CUDA/DfvRa/src/compiler/execution.jl:102 [inlined]
[14] vectorized_solve(probs::CuArray{ODEProblem{SVector{3, Float32}, Tuple{Float32, Float32}, false, SVector{3, Float32}, ODEFunction{false, SciMLBase.AutoSpecialize, typeof(lorenz), LinearAlgebra.UniformScaling{Bool}, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, typeof(SciMLBase.DEFAULT_OBSERVED), Nothing, Nothing}, Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}}, SciMLBase.StandardODEProblem}, 1, CUDA.Mem.DeviceBuffer}, prob::ODEProblem{SVector{3, Float32}, Tuple{Float32, Float32}, false, SVector{3, Float32}, ODEFunction{false, SciMLBase.AutoSpecialize, typeof(lorenz), LinearAlgebra.UniformScaling{Bool}, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, typeof(SciMLBase.DEFAULT_OBSERVED), Nothing, Nothing}, Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}}, SciMLBase.StandardODEProblem}, alg::GPUSimpleTsit5; dt::Float32, saveat::Nothing, save_everystep::Bool, debug::Bool, callback::CallbackSet{Tuple{}, Tuple{}}, tstops::Nothing, kwargs::Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}})
@ DiffEqGPU ~/.julia/packages/DiffEqGPU/CiiCq/src/solve.jl:34
[15] macro expansion
@ ~/.julia/packages/CUDA/DfvRa/src/utilities.jl:25 [inlined]
[16] var"##core#315"(probs#313::CuArray{ODEProblem{SVector{3, Float32}, Tuple{Float32, Float32}, false, SVector{3, Float32}, ODEFunction{false, SciMLBase.AutoSpecialize, typeof(lorenz), LinearAlgebra.UniformScaling{Bool}, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, typeof(SciMLBase.DEFAULT_OBSERVED), Nothing, Nothing}, Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}}, SciMLBase.StandardODEProblem}, 1, CUDA.Mem.DeviceBuffer}, ensembleProb#314::EnsembleProblem{ODEProblem{SVector{3, Float32}, Tuple{Float32, Float32}, false, SVector{3, Float32}, ODEFunction{false, SciMLBase.AutoSpecialize, typeof(lorenz), LinearAlgebra.UniformScaling{Bool}, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, typeof(SciMLBase.DEFAULT_OBSERVED), Nothing, Nothing}, Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}}, SciMLBase.StandardODEProblem}, var"#1#2", typeof(SciMLBase.DEFAULT_OUTPUT_FUNC), typeof(SciMLBase.DEFAULT_REDUCTION), Nothing})
@ Main ~/.julia/packages/BenchmarkTools/7xSXH/src/execution.jl:489
[17] var"##sample#316"(::Tuple{CuArray{ODEProblem{SVector{3, Float32}, Tuple{Float32, Float32}, false, SVector{3, Float32}, ODEFunction{false, SciMLBase.AutoSpecialize, typeof(lorenz), LinearAlgebra.UniformScaling{Bool}, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, typeof(SciMLBase.DEFAULT_OBSERVED), Nothing, Nothing}, Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}}, SciMLBase.StandardODEProblem}, 1, CUDA.Mem.DeviceBuffer}, EnsembleProblem{ODEProblem{SVector{3, Float32}, Tuple{Float32, Float32}, false, SVector{3, Float32}, ODEFunction{false, SciMLBase.AutoSpecialize, typeof(lorenz), LinearAlgebra.UniformScaling{Bool}, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, typeof(SciMLBase.DEFAULT_OBSERVED), Nothing, Nothing}, Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}}, SciMLBase.StandardODEProblem}, var"#1#2", typeof(SciMLBase.DEFAULT_OUTPUT_FUNC), typeof(SciMLBase.DEFAULT_REDUCTION), Nothing}}, __params::BenchmarkTools.Parameters)
@ Main ~/.julia/packages/BenchmarkTools/7xSXH/src/execution.jl:495
[18] _run(b::BenchmarkTools.Benchmark, p::BenchmarkTools.Parameters; verbose::Bool, pad::String, kwargs::Base.Pairs{Symbol, Integer, NTuple{4, Symbol}, NamedTuple{(:samples, :evals, :gctrial, :gcsample), Tuple{Int64, Int64, Bool, Bool}}})
@ BenchmarkTools ~/.julia/packages/BenchmarkTools/7xSXH/src/execution.jl:99
[19] #invokelatest#2
@ ./essentials.jl:731 [inlined]
[20] #run_result#45
@ ~/.julia/packages/BenchmarkTools/7xSXH/src/execution.jl:34 [inlined]
[21] run(b::BenchmarkTools.Benchmark, p::BenchmarkTools.Parameters; progressid::Nothing, nleaves::Float64, ndone::Float64, kwargs::Base.Pairs{Symbol, Integer, NTuple{5, Symbol}, NamedTuple{(:verbose, :samples, :evals, :gctrial, :gcsample), Tuple{Bool, Int64, Int64, Bool, Bool}}})
@ BenchmarkTools ~/.julia/packages/BenchmarkTools/7xSXH/src/execution.jl:117
[22] #warmup#54
@ ~/.julia/packages/BenchmarkTools/7xSXH/src/execution.jl:169 [inlined]
[23] warmup(item::BenchmarkTools.Benchmark)
@ BenchmarkTools ~/.julia/packages/BenchmarkTools/7xSXH/src/execution.jl:168
[24] top-level scope
@ ~/.julia/packages/BenchmarkTools/7xSXH/src/execution.jl:393
[25] top-level scope
@ ~/.julia/packages/CUDA/DfvRa/src/initialization.jl:52
removing callback = CallbackSet()
works though
Yes, please use that. I was checked out on development and hence made that change.
For A100 40GB:
Julia script above
Parameter number: 768000
Minimum time: 9.799917 ms
Allocs: 97
Okay yeah, so that's expected. A100's better than Tesla V100 and hence faster 14.6
ms.
Let me generate scripts for plotting + testing against MPGOS, and I'll share them with you.
MPGOS w/ default makefile
Total simulation time: 14.429ms
Simulation time / 1000 RK4 step: 14.429ms
Ensemble size: 768000
removing the architecture and maxregisters from the MPGOS makefile has virtual no impact on the results
I got similar results too and tried with sm_70
architecture as well. Can you run for 768
Ensemble size to check if we are getting that huge perf increase as shown in the plots?
Where do I set the ensemble for MPGOS?
I think it is this: https://github.com/nnagyd/ode_solver_tests/blob/master/Lorenz_RK4/GPU_MPGOS/Lorenz.cu#L17
And do:
make clean
make
Julia
Parameter number: 768
Minimum time: 0.196175 ms
Allocs: 46
MPGOS
Total simulation time: 3.115ms
Ensemble size: 768
@utkarsh530 Are you sure we are simulating the same things as MPGOS? https://github.com/nnagyd/ode_solver_tests/blob/master/Lorenz_RK4/CPU_Julia/CPU_Lorenz_julia_ensemble.jl has a 384 dim ODE.
Ignore the link above. It seems like they have code for single and coupled systems. However, the MPGOS single system still doesn't match what you are simulating.
They are varying ρ not σ. Its also not clear from the code what their IC and time range is.
Below more explicitly matches the MPGOS ODE.
function lorenz(u, p, t)
du1 = 10.0f0 * (u[2] - u[1])
du2 = p*u[1] - u[2] - u[1]*u[3]
du3 = u[1]*u[2] - 2.666f0 * u[3]
return @SVector [du1, du2, du3]
end
u0 = @SVector [1.0f0; 0.0f0; 0.0f0]
tspan = (0.0f0, 10.0f0)
p = 28f0
prob = ODEProblem(lorenz, u0, tspan, p)
parameterList = range(0.0f0,stop = 21.0f0,length=numberOfParameters)
lorenzProblem = ODEProblem(lorenz,u0,tspan,p)
prob_func = (prob, i, repeat) -> remake(prob, p = parameterList[i])
I also added tols to match as well
data = @benchmark @CUDA.sync DiffEqGPU.vectorized_solve($probs, $ensembleProb.prob, GPUSimpleTsit5();
save_everystep = false, dt = 0.01f0, abstol=1f-8, reltol=1f-8)
It doesn't really change the timing luckily
Parameter number: 768000
Minimum time: 9.704811 ms
I am not sure about that actually, but what they are trying to do is change unroll
which sets the number of Lorenz problems to solve on a single thread. See here in the Ensemble case, (where they are setting trajectories as numberOfParameters/unroll
): https://github.com/nnagyd/ode_solver_tests/blob/master/Lorenz_RK4/CPU_Julia/CPU_Lorenz_julia_ensemble.jl#L60
Hence, what we are simulating is unroll = 1
, at least from what I can say in Julia.
However, I still need to understand more about what they are doing.
Yeah. thanks for that. The tolerance doesn't make sense in fix time-stepping, right? Use vectorized_asolve
for the adaptive version.
I was not sure why they say Simulation time / 1000 RK4 step
. It confused me to use adaptive = false
case.
Why do you say fixed time
? The stats spit out by MPGOS show the min/max time steps as
Maximum time step: 1e+06
Minimum time step: 1e-12
Because when they were using EnsembleGPUArray
from Julia DiffEqGPU they were using adaptive = false
https://github.com/nnagyd/ode_solver_tests/blob/master/Lorenz_RK4/GPU_Julia/GPU_Lorenz_simple_julia.jl#L77
Its hard to tell what they were doing with Julia. However since we are benchmarking against https://github.com/nnagyd/ode_solver_tests/tree/master/Lorenz_RK4/GPU_MPGOS we should try to match that.
changing to vectorized_asolve
gives
data = @benchmark @CUDA.sync DiffEqGPU.vectorized_asolve($probs, $ensembleProb.prob, GPUSimpleTsit5();
save_everystep = false, dt = 0.01f0, abstol=1f-8, reltol=1f-8)
ERROR: MethodError: no method matching vectorized_asolve(::CuArray{ODEProblem{SVector{3, Float32}, Tuple{Float32, Float32}, false, Float32, ODEFunction{false, SciMLBase.AutoSpecialize, typeof(lorenz), LinearAlgebra.UniformScaling{Bool}, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, typeof(SciMLBase.DEFAULT_OBSERVED), Nothing, Nothing}, Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}}, SciMLBase.StandardODEProblem}, 1, CUDA.Mem.DeviceBuffer}, ::ODEProblem{SVector{3, Float32}, Tuple{Float32, Float32}, false, Float32, ODEFunction{false, SciMLBase.AutoSpecialize, typeof(lorenz), LinearAlgebra.UniformScaling{Bool}, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, typeof(SciMLBase.DEFAULT_OBSERVED), Nothing, Nothing}, Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}}, SciMLBase.StandardODEProblem}, ::GPUSimpleTsit5; save_everystep=false, dt=0.01f0, abstol=1.0f-8, reltol=1.0f-8)
Closest candidates are:
vectorized_asolve(::Any, ::ODEProblem, ::GPUSimpleATsit5; dt, saveat, save_everystep, abstol, reltol, debug, callback, tstops, kwargs...) at ~/.julia/packages/DiffEqGPU/CiiCq/src/solve.jl:57
Stacktrace:
[1] macro expansion
@ ~/.julia/packages/CUDA/DfvRa/src/utilities.jl:25 [inlined]
[2] var"##core#397"(probs#395::CuArray{ODEProblem{SVector{3, Float32}, Tuple{Float32, Float32}, false, Float32, ODEFunction{false, SciMLBase.AutoSpecialize, typeof(lorenz), LinearAlgebra.UniformScaling{Bool}, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, typeof(SciMLBase.DEFAULT_OBSERVED), Nothing, Nothing}, Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}}, SciMLBase.StandardODEProblem}, 1, CUDA.Mem.DeviceBuffer}, ensembleProb#396::EnsembleProblem{ODEProblem{SVector{3, Float32}, Tuple{Float32, Float32}, false, Float32, ODEFunction{false, SciMLBase.AutoSpecialize, typeof(lorenz), LinearAlgebra.UniformScaling{Bool}, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, typeof(SciMLBase.DEFAULT_OBSERVED), Nothing, Nothing}, Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}}, SciMLBase.StandardODEProblem}, var"#25#26", typeof(SciMLBase.DEFAULT_OUTPUT_FUNC), typeof(SciMLBase.DEFAULT_REDUCTION), Nothing})
@ Main ~/.julia/packages/BenchmarkTools/7xSXH/src/execution.jl:489
[3] var"##sample#398"(::Tuple{CuArray{ODEProblem{SVector{3, Float32}, Tuple{Float32, Float32}, false, Float32, ODEFunction{false, SciMLBase.AutoSpecialize, typeof(lorenz), LinearAlgebra.UniformScaling{Bool}, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, typeof(SciMLBase.DEFAULT_OBSERVED), Nothing, Nothing}, Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}}, SciMLBase.StandardODEProblem}, 1, CUDA.Mem.DeviceBuffer}, EnsembleProblem{ODEProblem{SVector{3, Float32}, Tuple{Float32, Float32}, false, Float32, ODEFunction{false, SciMLBase.AutoSpecialize, typeof(lorenz), LinearAlgebra.UniformScaling{Bool}, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, typeof(SciMLBase.DEFAULT_OBSERVED), Nothing, Nothing}, Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}}, SciMLBase.StandardODEProblem}, var"#25#26", typeof(SciMLBase.DEFAULT_OUTPUT_FUNC), typeof(SciMLBase.DEFAULT_REDUCTION), Nothing}}, __params::BenchmarkTools.Parameters)
@ Main ~/.julia/packages/BenchmarkTools/7xSXH/src/execution.jl:495
[4] _run(b::BenchmarkTools.Benchmark, p::BenchmarkTools.Parameters; verbose::Bool, pad::String, kwargs::Base.Pairs{Symbol, Integer, NTuple{4, Symbol}, NamedTuple{(:samples, :evals, :gctrial, :gcsample), Tuple{Int64, Int64, Bool, Bool}}})
@ BenchmarkTools ~/.julia/packages/BenchmarkTools/7xSXH/src/execution.jl:99
[5] #invokelatest#2
@ ./essentials.jl:731 [inlined]
[6] #run_result#45
@ ~/.julia/packages/BenchmarkTools/7xSXH/src/execution.jl:34 [inlined]
[7] run(b::BenchmarkTools.Benchmark, p::BenchmarkTools.Parameters; progressid::Nothing, nleaves::Float64, ndone::Float64, kwargs::Base.Pairs{Symbol, Integer, NTuple{5, Symbol}, NamedTuple{(:verbose, :samples, :evals, :gctrial, :gcsample), Tuple{Bool, Int64, Int64, Bool, Bool}}})
@ BenchmarkTools ~/.julia/packages/BenchmarkTools/7xSXH/src/execution.jl:117
[8] #warmup#54
@ ~/.julia/packages/BenchmarkTools/7xSXH/src/execution.jl:169 [inlined]
[9] warmup(item::BenchmarkTools.Benchmark)
@ BenchmarkTools ~/.julia/packages/BenchmarkTools/7xSXH/src/execution.jl:168
[10] top-level scope
@ ~/.julia/packages/BenchmarkTools/7xSXH/src/execution.jl:393
ah... needed GPUSimpleATsit5
julia> data = @benchmark @CUDA.sync DiffEqGPU.vectorized_solve($probs, $ensembleProb.prob, GPUSimpleTsit5();
save_everystep = false, dt = 0.01f0, abstol=1f-8, reltol=1f-8)
BenchmarkTools.Trial: 507 samples with 1 evaluation.
Range (min … max): 9.786 ms … 23.002 ms ┊ GC (min … max): 0.00% … 0.00%
Time (median): 9.808 ms ┊ GC (median): 0.00%
Time (mean ± σ): 9.858 ms ± 649.728 μs ┊ GC (mean ± σ): 0.00% ± 0.00%
▅▃█▇▄▃▁▄▃ ▂▁
▅▇████████████▇▆▆▅▅▆▂▄▄▃▆▅▅▆▃▃▄▃▃▃▁▄▃▄▁▂▂▁▂▃▁▃▁▁▂▁▁▁▁▂▁▁▂▁▂ ▄
9.79 ms Histogram: frequency by time 9.92 ms <
Memory estimate: 4.94 KiB, allocs estimate: 97.
julia> data = @benchmark @CUDA.sync DiffEqGPU.vectorized_asolve($probs, $ensembleProb.prob, GPUSimpleATsit5();
save_everystep = false, dt = 0.01f0, abstol=1f-8, reltol=1f-8)
BenchmarkTools.Trial: 186 samples with 1 evaluation.
Range (min … max): 26.867 ms … 27.191 ms ┊ GC (min … max): 0.00% … 0.00%
Time (median): 26.899 ms ┊ GC (median): 0.00%
Time (mean ± σ): 26.908 ms ± 44.145 μs ┊ GC (mean ± σ): 0.00% ± 0.00%
▄▄▆▅█▁▁▂
▃▃████████▇▅▅▅▄▃▃▃▃▃▁▁▁▃▁▁▁▃▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▃▁▁▃ ▃
26.9 ms Histogram: frequency by time 27.2 ms <
Memory estimate: 4.69 KiB, allocs estimate: 89.
Thanks for the results.
Okay, I think I got it. They are using two algorithms RK4
and RKCK45
. The RK4
is the adaptive = false
version, see here: https://github.com/nnagyd/ode_solver_tests/blob/master/Lorenz_RK4/GPU_MPGOS/SourceCodes/SingleSystem_PerThread_ExplicitRungeKutta_ErrorControllers.cuh#L6-L20. And they define to use RK4
here: https://github.com/nnagyd/ode_solver_tests/blob/master/Lorenz_RK4/GPU_MPGOS/Lorenz.cu#L15. Hence, I am pretty sure they are using adaptive = false
case.
They printing tolerance with RK4
which is still very weird.
I feel there might be something different with our adaptive time-stepping as RKCK45
is faster than RK4
.
I made some changes within the script: MPGOS is integrating till 1.0s. I am quite sure about adaptivity that RK4
is fixed-time step and RKCK45
is adaptive time-stepping. In any case, we are doing better now:
using DiffEqGPU, BenchmarkTools, StaticArrays, SimpleDiffEq
using CUDA
@show ARGS
#settings
numberOfParameters = isinteractive() ? 768000 : parse(Int64, ARGS[1])
gpuID = 0
device!(CuDevice(gpuID))
println("Running on "*string(CuDevice(gpuID)))
function lorenz(u, p, t)
du1 = 10.0f0 * (u[2] - u[1])
du2 = p*u[1] - u[2] - u[1]*u[3]
du3 = u[1]*u[2] - 2.666f0 * u[3]
return @SVector [du1, du2, du3]
end
u0 = @SVector [1.0f0; 0.0f0; 0.0f0]
tspan = (0.0f0, 1.0f0)
p = 28f0
prob = ODEProblem(lorenz, u0, tspan, p)
parameterList = range(0.0f0,stop = 21.0f0,length=numberOfParameters)
lorenzProblem = ODEProblem(lorenz,u0,tspan,p)
prob_func = (prob, i, repeat) -> remake(prob, p = parameterList[i])
ensembleProb = EnsembleProblem(lorenzProblem,prob_func = prob_func)
## Building problems here only
I = 1:numberOfParameters
if ensembleProb.safetycopy
probs = map(I) do i
ensembleProb.prob_func(deepcopy(ensembleProb.prob), i, 1)
end
else
probs = map(I) do i
ensembleProb.prob_func(ensembleProb.prob, i, 1)
end
end
## Make them compatible with CUDA
probs = cu(probs)
@info "Solving the problem"
data = @benchmark @CUDA.sync DiffEqGPU.vectorized_solve($probs, $ensembleProb.prob, GPUSimpleTsit5();
save_everystep = false, dt = 0.001f0)
println("Parameter number: "*string(numberOfParameters))
println("Minimum time: "*string(minimum(data.times)/1e6)*" ms")
println("Allocs: "*string(data.allocs))
data = @benchmark @CUDA.sync DiffEqGPU.vectorized_asolve($probs, $ensembleProb.prob, GPUSimpleATsit5();
dt = 0.001f0,reltol = 1f-8,abstol=1f-8)
println("Parameter number: "*string(numberOfParameters))
println("Minimum time: "*string(minimum(data.times)/1e6)*" ms")
println("Allocs: "*string(data.allocs))
julia> data = @benchmark @CUDA.sync DiffEqGPU.vectorized_solve($probs, $ensembleProb.prob, GPUSimpleTsit5();
save_everystep = false, dt = 0.001f0)
BenchmarkTools.Trial: 295 samples with 1 evaluation.
Range (min … max): 14.925 ms … 36.507 ms ┊ GC (min … max): 0.00% … 0.00%
Time (median): 15.416 ms ┊ GC (median): 0.00%
Time (mean ± σ): 16.955 ms ± 4.546 ms ┊ GC (mean ± σ): 0.00% ± 0.00%
▇█▅
████▇▅▅█▄▄▄▁▄▁▁▁▄▁▄▄▁▁▁▁▁▁▁▄▁▁▁▁▁▁▁▁▁▄▁▄▅▆▅▅▆▄▁▄▁▁▄▁▄▁▄▅▁▁▄ ▅
14.9 ms Histogram: log(frequency) by time 35.2 ms <
Memory estimate: 5.16 KiB, allocs estimate: 107.
julia> println("Parameter number: "*string(numberOfParameters))
Parameter number: 768000
julia> println("Minimum time: "*string(minimum(data.times)/1e6)*" ms")
Minimum time: 14.925271 ms
julia> println("Allocs: "*string(data.allocs))
Allocs: 107
julia> data = @benchmark @CUDA.sync DiffEqGPU.vectorized_asolve($probs, $ensembleProb.prob, GPUSimpleATsit5();
dt = 0.001f0,reltol = 1f-8,abstol=1f-8)
BenchmarkTools.Trial: 560 samples with 1 evaluation.
Range (min … max): 7.059 ms … 42.164 ms ┊ GC (min … max): 0.00% … 0.00%
Time (median): 7.723 ms ┊ GC (median): 0.00%
Time (mean ± σ): 8.921 ms ± 4.485 ms ┊ GC (mean ± σ): 0.00% ± 0.00%
▅█▆▂
████▇▅▇▅▅▁▄▄▄▄▅▁▅▄▁▄▄▁▁▁▁▅▁▁▁▁▁▁▁▆▆▁▅▁▁▄▄▁▁▁▁▄▄▁▄▄▄▄▁▅▁▄▄▄ ▆
7.06 ms Histogram: log(frequency) by time 29.1 ms <
Memory estimate: 4.80 KiB, allocs estimate: 97.
RKCK45 is a Cash Karp method most comparable to Tsit5, not RK4
is minimum time the right measurement here? IMO throughout matters a lot more when you're on a gpu
Superceded by https://github.com/utkarsh530/GPUODEBenchmarks