Performance on different computing platforms

Question

Performance on different computing platforms

Closed this issue 12 days ago · 8 comments

ANN training time with (commit 61c1923)

learn_schedule = [
    (batchsize = 1000, optimizer = GradientDecent(0.1), epochs = 1),
    (batchsize = 5000, optimizer = GradientDecent(0.05), epochs = 1),
    (batchsize = 50000, optimizer = GradientDecent(0.025), epochs = 1),
]

10500000 ANN input evaluations in total:

Intel i9-9880 laptop CPU, 8 threads: 52 s
AMD EPYC 7702P CPU: 43 s
NVIDIA GeForce GTX 1650 (mobile), CUDA.jl v4.1.3: 9 s
NVIDIA A100-PCIE-40GB, CUDA.jl v4.1.3: 9 s
AMD MI210, AMDGPU.jl v0.4.8: 63 s
Intel UHD Graphics 630 (mobile), oneAPI.jl v1.2.0: Extremely slow, ANN training stalled half-way (memory management issue?)
Apple M1 GPU (Mac mini), Metal.jl v0.3.0: Extremely slow, ANN training stalled half-way (memory management issue?)

Answer 1 · 2023-04-11T15:45:58.000Z

At some point during the computation with Metal, I had 29.45GB of memory used, and 71.14 GB of Swap, the Julia process was using 90GB of memory.

Also this error which I mentioned on Slack which I hacked a fix by just adapting the model and x to be on cpu.

Error:

julia> Y_train_v = Array(vec(batched_eval(m_trained, X_train)))
ERROR: ArgumentError: cannot take the CPU address of a MtlMatrix{Float32}
Stacktrace:
  [1] unsafe_convert(#unused#::Type{Ptr{Float32}}, x::MtlMatrix{Float32})
    @ Metal ~/.julia/packages/Metal/TtPHW/src/array.jl:121
  [2] gemm!(transA::Char, transB::Char, alpha::Float32, A::MtlMatrix{Float32}, B::SubArray{Float32, 2, MtlMatrix{Float32}, Tuple{Base.Slice{Base.OneTo{Int64}}, UnitRange{Int64}}, true}, beta::Float32, C::MtlMatrix{Float32})
    @ LinearAlgebra.BLAS ~/.julia/juliaup/julia-1.8.5+0.aarch64.apple.darwin14/share/julia/stdlib/v1.8/LinearAlgebra/src/blas.jl:1514
  [3] gemm_wrapper!(C::MtlMatrix{Float32}, tA::Char, tB::Char, A::MtlMatrix{Float32}, B::SubArray{Float32, 2, MtlMatrix{Float32}, Tuple{Base.Slice{Base.OneTo{Int64}}, UnitRange{Int64}}, true}, _add::LinearAlgebra.MulAddMul{true, true, Bool, Bool})
    @ LinearAlgebra ~/.julia/juliaup/julia-1.8.5+0.aarch64.apple.darwin14/share/julia/stdlib/v1.8/LinearAlgebra/src/matmul.jl:674
  [4] mul!
    @ ~/.julia/juliaup/julia-1.8.5+0.aarch64.apple.darwin14/share/julia/stdlib/v1.8/LinearAlgebra/src/matmul.jl:161 [inlined]
  [5] mul!
    @ ~/.julia/juliaup/julia-1.8.5+0.aarch64.apple.darwin14/share/julia/stdlib/v1.8/LinearAlgebra/src/matmul.jl:276 [inlined]
  [6] *
    @ ~/.julia/juliaup/julia-1.8.5+0.aarch64.apple.darwin14/share/julia/stdlib/v1.8/LinearAlgebra/src/matmul.jl:148 [inlined]
  [7] Fix1
    @ ./operators.jl:1096 [inlined]
  [8] (::ComposedFunction{ComposedFunction{ComposedFunction{ComposedFunction{ComposedFunction{BroadcastFunction{typeof(logistic)}, ComposedFunction{Fix1{BroadcastFunction{typeof(+)}, MtlVector{Float32}}, Fix1{typeof(*), MtlMatrix{Float32}}}}, BroadcastFunction{typeof(relu)}}, ComposedFunction{Fix1{BroadcastFunction{typeof(+)}, MtlVector{Float32}}, Fix1{typeof(*), MtlMatrix{Float32}}}}, BroadcastFunction{typeof(relu)}}, ComposedFunction{Fix1{BroadcastFunction{typeof(+)}, MtlVector{Float32}}, Fix1{typeof(*), MtlMatrix{Float32}}}})(x::SubArray{Float32, 2, MtlMatrix{Float32}, Tuple{Base.Slice{Base.OneTo{Int64}}, UnitRange{Int64}}, true}; kw::Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}})
    @ Base ./operators.jl:1035
  [9] ComposedFunction
    @ ./operators.jl:1033 [inlined]
 [10] batched_eval(m::ComposedFunction{ComposedFunction{ComposedFunction{ComposedFunction{ComposedFunction{BroadcastFunction{typeof(logistic)}, ComposedFunction{Fix1{BroadcastFunction{typeof(+)}, MtlVector{Float32}}, Fix1{typeof(*), MtlMatrix{Float32}}}}, BroadcastFunction{typeof(relu)}}, ComposedFunction{Fix1{BroadcastFunction{typeof(+)}, MtlVector{Float32}}, Fix1{typeof(*), MtlMatrix{Float32}}}}, BroadcastFunction{typeof(relu)}}, ComposedFunction{Fix1{BroadcastFunction{typeof(+)}, MtlVector{Float32}}, Fix1{typeof(*), MtlMatrix{Float32}}}}, X::MtlMatrix{Float32}; batchsize::Int64)
    @ Main ~/julia-ml-from-scratch/ml_from_scratch.jl:452
 [11] batched_eval(m::Function, X::MtlMatrix{Float32})
    @ Main ~/julia-ml-from-scratch/ml_from_scratch.jl:447
 [12] top-level scope
    @ REPL[14]:1
 [13] top-level scope
    @ ~/.julia/packages/Metal/TtPHW/src/initialization.jl:46

Answer 2 · 2023-04-12T21:12:33.000Z

I assume the Metal.jl issue will be opened in the Metal.jl repo.

Answer 3 · 2023-04-13T18:10:22.000Z

@ViralBShah I assume the Metal.jl issue will be opened in the Metal.jl repo.

@maleadt , should I open a Metal.jl issue as well?

Answer 4 · 2023-04-13T18:14:24.000Z

As mentioned on Slack I normally prefer more specific/concrete issues -- less actionable ones tend to get forgotten -- but we can always use a tracking issue on the Metal.jl repo, sure.

Answer 5 · 2023-04-13T18:19:03.000Z

Yep, I'm afraid I don't have more specific stuff at this point, haven't done any in-depth profiling (and I'm so not a Metal.jl expert). Maybe @christiangnrd can open some more concrete issues - do you plan to investigate this a bit more on Apple Silicon, @christiangnrd ?

Answer 6 · 2023-04-14T16:02:22.000Z

I'll try to investigate a bit but I'm very new to GPU programming so I might only be able to help with surface-level things.

Answer 7 · 2023-04-14T18:46:37.000Z

No worries, I'm sure any kind of contribution will be very much appreciated!

Answer 8 · 2024-10-03T11:55:24.000Z

The demo runs through on Metal now on a recent MacBook, so I'm closing this. Will try do do a proper runtime comparison across recent versions of the various GPU backends again sometime.