Port `accmulate!` and `findall` from CUDA.jl

Question

Port `accmulate!` and `findall` from CUDA.jl

Closed this issue 2 months ago · 8 comments

Hi there, this is a really great package. Thanks for your great efforts!

And my questions is:
Is there any plan to support functions like accumulate!, cumsum and findall or similar in Metal.jl like CUDA.jl?
Or could you point me to any resources that I can follow to write these functions in Metal.jl?

Thanks!

christiangnrd commented 2 months ago

#383

Answer 1 · 2024-05-10T19:00:31.000Z

Hi zhenwu0728, thank you for your interest!

The metal performance shaders library seems to provide functionality for cumsum. I have a local branch with the library wrappers implemented, but I won't be returning to it until I finish my thesis, so feel free to attempt it by yourself. There are a few recent PRs that you could use as a template.

Another option for the cumsum and the others is to adapt the CUDA.jl implementations for Metal.jl.

Answer 2 · 2024-05-13T07:26:44.000Z

Another option for the cumsum and the others is to adapt the CUDA.jl implementations for Metal.jl.

Or, alternatively, port them to GPUArrays.jl which would make them available to all back-ends.

Answer 3 · 2024-05-13T21:32:50.000Z

Or, alternatively, port them to GPUArrays.jl which would make them available to all back-ends.

@maleadt Any ideas when this will happen? I've seen similar codes for accumulate in CUDA.jl and AMDGPU.jl.

Answer 4 · 2024-05-14T11:35:00.000Z

Any ideas when this will happen?

Probably only after GPUArrays.jl migrates to KernelAbstractions.jl, which is still some weeks-months off. So feel free to take a stab at a native (i.e. in Metal.jl) implementation first if you require this functionality.

Answer 5 · 2024-06-19T13:57:48.000Z

If anyone else encounters this problem and really needs cumsum, I implemented a very simple version using GPUArrays.jl that does cumsum(x; dims=2) (the only version I needed, but could be easily modified for dims=1):

# cumsum(x; dims=2), thanks to https://pde-on-gpu.vaw.ethz.ch/lecture10/
function cumsum2(A::AnyGPUMatrix{T}) where {T}
    B = similar(A)
    gpu_call(B, A; name="cumsum!", elements=size(A, 1)) do ctx, B, A
        idx = @cartesianidx B
        i, j = Tuple(idx)
        cur_val = zero(T)
        for k in 1:size(A, 2)
            @inbounds cur_val += A[i, k]
            @inbounds B[i, k] = cur_val
        end
        return
    end
    return B
end
# Potential improvements: use shared memory and block/grid sizes

At least in my tests, it seems to be comparable to CUDA's cumsum and multiple times faster than doing gpu(cumsum(cpu(x))).

# Benchmarking
using BenchmarkTools
b1 = @benchmark CUDA.@sync cumsum2($A)                  # My cumsum
b2 = @benchmark CUDA.@sync cumsum($A; dims=2)           # CUDA
b3 = @benchmark CUDA.@sync gpu(cumsum(cpu($A); dims=2)) # Current workaround
speedup_compared_to_cuda = mean(b2.times) / mean(b1.times)
speedup_compared_to_current = mean(b3.times) / mean(b1.times)

# A = randn(Float32, 101, 8) |> gpu
# speedup_compared_to_cuda = 1.1598880188052891
# speedup_compared_to_current = 1.7993942371142466

# A = randn(Float32, 10_001, 800) |> gpu
# speedup_compared_to_cuda = 3.420802824285882
# speedup_compared_to_current = 89.42437204572116

This is also an issue for oneAPI.jl, but the GPUArrays.jl solution should apply to both.

Answer 6 · 2024-07-05T12:35:16.000Z

Resolved by #377 and #382.

Answer 7 · 2024-07-05T15:26:12.000Z

Can you register a new version of the package? I would like to use these two functions. Thanks!