omlins/ParallelStencil.jl

metal backend ?

Closed this issue ยท 14 comments

Hi,

Considering that apple GPUs are the only devices allowing for very large fast RAM (up to 128 Go) for (relatively) affordable price,
I wonder about the amount of work that would be necessary to add a metal backend to PS.

I also wonder if the best way to achieve that would be using a KernelAbstractions.jl backend or to extend the current PS collection of backend.

Any though ?

Laurent

Metal support has recently been merged in #175. Should be available in the next release

Great ! I have just test it (dev) and it seems to run ok !
The performance is a bit lower than what I expected (seems to improve with the size of the problem).
I wonder about the use of FP64 literals in some FD macro definitions like
macro av(A) @expandargs(A); esc(:(($A[$ixi-1,$iyi-1] + $A[$ixi,$iyi-1] + $A[$ixi-1,$iyi] + $A[$ixi,$iyi])*0.25 )) end
Could this lead to costly conversions when dealing with FP32 ?

You could try manually writing the kernel with @parallel_indices instead and explicitly writing A[ixi-1,iyi-1] + A[ixi,iyi-1] + A[ixi-1,iyi] + A[ixi,iyi]) / 4. Does this improve the performance?

I will try this !

Great ! I have just test it (dev) and it seems to run ok ! The performance is a bit lower than what I expected (seems to improve with the size of the problem). I wonder about the use of FP64 literals in some FD macro definitions like macro av(A) @expandargs(A); esc(:(($A[$ixi-1,$iyi-1] + $A[$ixi,$iyi-1] + $A[$ixi-1,$iyi] + $A[$ixi,$iyi])*0.25 )) end Could this lead to costly conversions when dealing with FP32 ?

While this seems to be hard coded, in reality it is not: literal types are converted to the right type before the kernel expression is given back to the user (at parse time). To enable simpler dealing with literal types, as a user you only need to distinguish between floats and integers, write for example 1.0 for floats and 1 for integers.

Yes. I confirm that the performance does not change with FP32 literals.
I obtain a 191 GB/s effective bandwidth for a 512^3 problem (acoustic wave with variable velocity field).
I was a bit disappointed because the bandwidth reaches 360 GB/s on a simple array copy on this machine (M1 Max).

Thanks again for the Metal backend !

Which is about 53% of peak for your application. What's the percentage on e.g. Nvidia GPU? Maybe tuning the kernel launch parameters could help a bit in Metal? Although, 53% isn't that bad ;)

Yes it is not bad for this size (512^3).
I do not remember what was the percentage on my NVidia board (I will check).
The thing is the multithread CPU is rather good on apple silicon and the GPU speed up is less impressive compared to Intel CPU and NVidia GPU ;)

I do not remember how to tune the kernel launch parameters with PS.

multithread CPU is rather good on apple silicon

Fair point!

how to tune the kernel launch parameters with PS.

From ? @parallel in the REPL: @parallel nblocks nthreads kernelcall can be used to launch with tuned params. In Metal, the heuristics seems to be the same as for CUDA, i.e., max threads per block is 256, and threads.x is 32, see

const NTHREADS_X_MAX = 32

The remaining thread and block sizes is then inferred from
function compute_nthreads(maxsize; nthreads_x_max=NTHREADS_X_MAX, nthreads_max=NTHREADS_MAX, flatdim=0) # This is a heuristic, which results in (32,8,1) threads, except if maxsize[1] < 32 or maxsize[2] < 8.
maxsize = promote_maxsize(maxsize)
nthreads_x = min(nthreads_x_max, (flatdim==1) ? 1 : maxsize[1])
nthreads_y = min(ceil(Int,nthreads_max/nthreads_x), (flatdim==2) ? 1 : maxsize[2])
nthreads_z = min(ceil(Int,nthreads_max/(nthreads_x*nthreads_y)), (flatdim==3) ? 1 : maxsize[3])
return (nthreads_x, nthreads_y , nthreads_z)
end
function compute_nblocks(maxsize, nthreads)
maxsize = promote_maxsize(maxsize)
if !(isa(nthreads, Union{AbstractArray,Tuple}) && length(nthreads)==3) @ArgumentError("nthreads must be an Array or Tuple of size 3 (obtained: $nthreads; its type is: $(typeof(nthreads))).") end
return ceil.(Int, maxsize./nthreads)
end

@omlins it may be interesting to have the per backend heuristic defined in the extension now that we have those?

Thanks !
I will try to tune the parameters for smaller size. The perf is actually quite good and the CPU counterparts is actually 5 times slower ;)

@LaurentPlagne : unfortunately, I don't have access to a Metal capable GPU in order to test and tune myself.
@GiackAloZ : have you found any theoretical information or made any experience that is useful to define the heuristics for the Metal backend?

@omlins : I can run the test you provide on my machine if that helps.
@GiackAloZ : I have enjoyed a lot your presentation at JuliaCon 2024. Our simulations looks similar (but simpler) to what you have presented. Would you be available for a quick talk any time soon ?

have you found any theoretical information or made any experience that is useful to define the heuristics for the Metal backend?

@omlins I have not dug deep enough into that as of now, but I could try to gather some info and then come up with an heuristic. I also can run the tests ofc.

I have enjoyed a lot your presentation at JuliaCon 2024. Our simulations looks similar (but simpler) to what you have presented. Would you be available for a quick talk any time soon ?

@LaurentPlagne Thanks a lot! Surely we can chat, contact me vie the email address you can find in my GH profile page.