Performance on different computing platforms
Closed this issue · 8 comments
ANN training time with (commit 61c1923)
learn_schedule = [
(batchsize = 1000, optimizer = GradientDecent(0.1), epochs = 1),
(batchsize = 5000, optimizer = GradientDecent(0.05), epochs = 1),
(batchsize = 50000, optimizer = GradientDecent(0.025), epochs = 1),
]
10500000 ANN input evaluations in total:
- Intel i9-9880 laptop CPU, 8 threads: 52 s
- AMD EPYC 7702P CPU: 43 s
- NVIDIA GeForce GTX 1650 (mobile), CUDA.jl v4.1.3: 9 s
- NVIDIA A100-PCIE-40GB, CUDA.jl v4.1.3: 9 s
- AMD MI210, AMDGPU.jl v0.4.8: 63 s
- Intel UHD Graphics 630 (mobile), oneAPI.jl v1.2.0: Extremely slow, ANN training stalled half-way (memory management issue?)
- Apple M1 GPU (Mac mini), Metal.jl v0.3.0: Extremely slow, ANN training stalled half-way (memory management issue?)
At some point during the computation with Metal, I had 29.45GB of memory used, and 71.14 GB of Swap, the Julia process was using 90GB of memory.
Also this error which I mentioned on Slack which I hacked a fix by just adapting the model and x to be on cpu.
Error:
julia> Y_train_v = Array(vec(batched_eval(m_trained, X_train)))
ERROR: ArgumentError: cannot take the CPU address of a MtlMatrix{Float32}
Stacktrace:
[1] unsafe_convert(#unused#::Type{Ptr{Float32}}, x::MtlMatrix{Float32})
@ Metal ~/.julia/packages/Metal/TtPHW/src/array.jl:121
[2] gemm!(transA::Char, transB::Char, alpha::Float32, A::MtlMatrix{Float32}, B::SubArray{Float32, 2, MtlMatrix{Float32}, Tuple{Base.Slice{Base.OneTo{Int64}}, UnitRange{Int64}}, true}, beta::Float32, C::MtlMatrix{Float32})
@ LinearAlgebra.BLAS ~/.julia/juliaup/julia-1.8.5+0.aarch64.apple.darwin14/share/julia/stdlib/v1.8/LinearAlgebra/src/blas.jl:1514
[3] gemm_wrapper!(C::MtlMatrix{Float32}, tA::Char, tB::Char, A::MtlMatrix{Float32}, B::SubArray{Float32, 2, MtlMatrix{Float32}, Tuple{Base.Slice{Base.OneTo{Int64}}, UnitRange{Int64}}, true}, _add::LinearAlgebra.MulAddMul{true, true, Bool, Bool})
@ LinearAlgebra ~/.julia/juliaup/julia-1.8.5+0.aarch64.apple.darwin14/share/julia/stdlib/v1.8/LinearAlgebra/src/matmul.jl:674
[4] mul!
@ ~/.julia/juliaup/julia-1.8.5+0.aarch64.apple.darwin14/share/julia/stdlib/v1.8/LinearAlgebra/src/matmul.jl:161 [inlined]
[5] mul!
@ ~/.julia/juliaup/julia-1.8.5+0.aarch64.apple.darwin14/share/julia/stdlib/v1.8/LinearAlgebra/src/matmul.jl:276 [inlined]
[6] *
@ ~/.julia/juliaup/julia-1.8.5+0.aarch64.apple.darwin14/share/julia/stdlib/v1.8/LinearAlgebra/src/matmul.jl:148 [inlined]
[7] Fix1
@ ./operators.jl:1096 [inlined]
[8] (::ComposedFunction{ComposedFunction{ComposedFunction{ComposedFunction{ComposedFunction{BroadcastFunction{typeof(logistic)}, ComposedFunction{Fix1{BroadcastFunction{typeof(+)}, MtlVector{Float32}}, Fix1{typeof(*), MtlMatrix{Float32}}}}, BroadcastFunction{typeof(relu)}}, ComposedFunction{Fix1{BroadcastFunction{typeof(+)}, MtlVector{Float32}}, Fix1{typeof(*), MtlMatrix{Float32}}}}, BroadcastFunction{typeof(relu)}}, ComposedFunction{Fix1{BroadcastFunction{typeof(+)}, MtlVector{Float32}}, Fix1{typeof(*), MtlMatrix{Float32}}}})(x::SubArray{Float32, 2, MtlMatrix{Float32}, Tuple{Base.Slice{Base.OneTo{Int64}}, UnitRange{Int64}}, true}; kw::Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}})
@ Base ./operators.jl:1035
[9] ComposedFunction
@ ./operators.jl:1033 [inlined]
[10] batched_eval(m::ComposedFunction{ComposedFunction{ComposedFunction{ComposedFunction{ComposedFunction{BroadcastFunction{typeof(logistic)}, ComposedFunction{Fix1{BroadcastFunction{typeof(+)}, MtlVector{Float32}}, Fix1{typeof(*), MtlMatrix{Float32}}}}, BroadcastFunction{typeof(relu)}}, ComposedFunction{Fix1{BroadcastFunction{typeof(+)}, MtlVector{Float32}}, Fix1{typeof(*), MtlMatrix{Float32}}}}, BroadcastFunction{typeof(relu)}}, ComposedFunction{Fix1{BroadcastFunction{typeof(+)}, MtlVector{Float32}}, Fix1{typeof(*), MtlMatrix{Float32}}}}, X::MtlMatrix{Float32}; batchsize::Int64)
@ Main ~/julia-ml-from-scratch/ml_from_scratch.jl:452
[11] batched_eval(m::Function, X::MtlMatrix{Float32})
@ Main ~/julia-ml-from-scratch/ml_from_scratch.jl:447
[12] top-level scope
@ REPL[14]:1
[13] top-level scope
@ ~/.julia/packages/Metal/TtPHW/src/initialization.jl:46
I assume the Metal.jl issue will be opened in the Metal.jl repo.
@ViralBShah I assume the Metal.jl issue will be opened in the Metal.jl repo.
@maleadt , should I open a Metal.jl issue as well?
As mentioned on Slack I normally prefer more specific/concrete issues -- less actionable ones tend to get forgotten -- but we can always use a tracking issue on the Metal.jl repo, sure.
Yep, I'm afraid I don't have more specific stuff at this point, haven't done any in-depth profiling (and I'm so not a Metal.jl expert). Maybe @christiangnrd can open some more concrete issues - do you plan to investigate this a bit more on Apple Silicon, @christiangnrd ?
I'll try to investigate a bit but I'm very new to GPU programming so I might only be able to help with surface-level things.
No worries, I'm sure any kind of contribution will be very much appreciated!
The demo runs through on Metal now on a recent MacBook, so I'm closing this. Will try do do a proper runtime comparison across recent versions of the various GPU backends again sometime.