SciML/DiffEqGPU.jl

Performance example Parameter-Parallelism with GPU Ensemble Methods

roflmaostc opened this issue · 3 comments

Hi,

just copy pasting the example:

# ╔═╡ 0094c9ce-0bf7-40ff-b6df-dcddf94828cd
using DiffEqGPU, OrdinaryDiffEq, StaticArrays, CUDA

# ╔═╡ 0c65ebbf-95b8-44eb-8789-8c03c29442d8
function lorenz(u, p, t)
    σ = p[1]
    ρ = p[2]
    β = p[3]
    du1 = σ * (u[2] - u[1])
    du2 = u[1] *- u[3]) - u[2]
    du3 = u[1] * u[2] - β * u[3]
    return SVector{3}(du1, du2, du3)
end

# ╔═╡ faf01799-9b68-463d-b439-504d78ebb2e6
u0 = @SVector [1.0f0; 0.0f0; 0.0f0]

# ╔═╡ fd96be2f-eab0-4cf9-9680-8af8aa5fc54f
tspan = (0.0f0, 10.0f0)

# ╔═╡ 64bb027c-bc7b-443d-a712-8c2c9bf577ac
p = @SVector [10.0f0, 28.0f0, 8 / 3.0f0]

# ╔═╡ 4b77d5b2-2988-4c2a-ab51-a424a016d85a
prob = ODEProblem{false}(lorenz, u0, tspan, p)

# ╔═╡ f5c3ee38-59b5-4840-9cea-ea2c1a52c512
prob_func = (prob, i, repeat) -> remake(prob, p = (@SVector rand(Float32, 3)) .* p)

# ╔═╡ 169bb61d-8a59-4ad1-9936-d594ce5eb2de
monteprob = EnsembleProblem(prob, prob_func = prob_func, safetycopy = false)

# ╔═╡ 097eeba0-3608-498f-9574-b257b6e9604a
@time sol2 = solve(monteprob, Tsit5(), trajectories = 100_000,
                  adaptive = false, dt = 0.1f0)

# ╔═╡ 5cccc317-125b-4c23-abcf-09292ab9c7e8
CUDA.@time sol = solve(monteprob, GPUTsit5(), EnsembleGPUKernel(), trajectories = 100_000,
                  adaptive = false, dt = 0.1f0)

I don't see an improvement with the GPU version.

# CPU
  0.182350 seconds (14.10 M allocations: 1.822 GiB)
# GPU
 0.249538 seconds (3.44 M CPU allocations: 382.376 MiB) (3 GPU allocations: 126.038 MiB, 0.01% memmgmt time)

System:

Julia Version 1.8.5
Commit 17cfb8e65ea (2023-01-08 06:45 UTC)
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 24 × AMD Ryzen 9 5900X 12-Core Processor
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-13.0.1 (ORCJIT, znver3)
  Threads: 12 on 24 virtual cores
Environment:
  JULIA_REVISE_WORKER_ONLY = 1

NVIDIA GeForce RTX 3060 (GPU 0)

Is there anything wrong or is this expected?

Best,

Felix

Your measurements look really weird. Are you sure that's what was actually ran? Do it in the REPL, not a Pluto notebook:

using DiffEqGPU, OrdinaryDiffEq, StaticArrays

function lorenz(u, p, t)
    σ = p[1]
    ρ = p[2]
    β = p[3]
    du1 = σ * (u[2] - u[1])
    du2 = u[1] *- u[3]) - u[2]
    du3 = u[1] * u[2] - β * u[3]
    return SVector{3}(du1, du2, du3)
end

u0 = @SVector [1.0f0; 0.0f0; 0.0f0]
tspan = (0.0f0, 10.0f0)
p = @SVector [10.0f0, 28.0f0, 8 / 3.0f0]
prob = ODEProblem{false}(lorenz, u0, tspan, p)
prob_func = (prob, i, repeat) -> remake(prob, p = (@SVector rand(Float32, 3)) .* p)
monteprob = EnsembleProblem(prob, prob_func = prob_func, safetycopy = false)

@time sol2 = solve(monteprob, Tsit5(), trajectories = 100_000,
                    adaptive = false, dt = 0.1f0)
# 9.721686 seconds (13.30 M allocations: 1.775 GiB, 17.76% gc time)

DiffEqGPU.CUDA.@time sol = solve(monteprob, GPUTsit5(), EnsembleGPUKernel(), trajectories = 10_000,
                  adaptive = false, dt = 0.1f0)

# 0.022129 seconds (255.12 k CPU allocations: 33.413 MiB) (3 GPU allocations: 12.604 MiB, 0.09% memmgmt time)

#=
julia> versioninfo()
Julia Version 1.9.0-beta3
Commit 24204a7344 (2023-01-18 07:20 UTC)
Platform Info:
  OS: Windows (x86_64-w64-mingw32)
  CPU: 32 × AMD Ryzen 9 5950X 16-Core Processor
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-14.0.6 (ORCJIT, znver3)
  Threads: 1 on 32 virtual cores
Environment:
  JULIA_IMAGE_THREADS = 1
  JULIA_EDITOR = code
  JULIA_NUM_THREADS =
=#

with a 2080 Super. How your Ryzen 9 5900 is an order of magnitude faster and your 3060 is an order of magnitude slower is... weird to say the least.

@time sol2 = solve(monteprob, Tsit5(), trajectories = 100_000,
DiffEqGPU.CUDA.@time sol = solve(monteprob, GPUTsit5(), EnsembleGPUKernel(), trajectories = 10_000,

did you measure with different numbers of trajectories?

I've been using Pluto, but below my REPL results.

As far as I can see, it uses 12 threads. Is my processor on steroids?

julia> @time sol2 = solve(monteprob, Tsit5(), trajectories = 100_000,
                           adaptive = false, dt = 0.1f0)
  0.199818 seconds (13.10 M allocations: 1.772 GiB)

julia> DiffEqGPU.CUDA.@time sol = solve(monteprob, GPUTsit5(), EnsembleGPUKernel(), trajectories = 100_000,
                         adaptive = false, dt = 0.1f0)
  0.189582 seconds (2.44 M CPU allocations: 330.496 MiB) (3 GPU allocations: 126.038 MiB, 0.16% memmgmt time)
EnsembleSolution Solution of length 100000 with uType:
ODESolution{Float32, 2, uType, Nothing, Nothing, tType, Nothing, P, A, IType, Nothing, Nothing} where {uType, tType, P, A, IType}


julia> versioninfo()
Julia Version 1.8.5
Commit 17cfb8e65ea (2023-01-08 06:45 UTC)
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 24 × AMD Ryzen 9 5900X 12-Core Processor
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-13.0.1 (ORCJIT, znver3)
  Threads: 12 on 24 virtual cores

This could be where copying GPU Arrays back to the CPU to build the EnsembleSolution is more expensive than the whole GPU solution? Try using lower-level API: https://docs.sciml.ai/DiffEqGPU/dev/tutorials/lower_level_api/