hipblasdgemm not getting close to peak

Question

hipblasdgemm not getting close to peak

JorgeG94 opened this issue 2 years ago · 6 comments

JorgeG94 commented 2 years ago

What is the expected behavior

I would expect a dgemm of sizeable input to achieve close to the 47.9 TFLOP/s

What actually happens

By using the code provided in: https://github.com/JorgeG94/calum_performance_tool it can be seen that rocm/5.4.0 produces 38TFLOP/s

How to reproduce

The code in the repo https://github.com/JorgeG94/calum_performance_tool has the readme, but basically:
hipcc -L/opt/rocm-5.4.3/lib -lhipblas --offload-arch=gfx90a performance.cpp
./a.out 36000 14400 36000 10 T T

Environment

Hardware	description
GPU	MI250x
CPU	AMD Optimized 3rd Gen EPYC

Software	version
ROCM	v5.4.0

Answer 1 · 2023-04-06T00:34:06.000Z

I've tried larger sizes and at some point the code just breaks without ever breaking the 40 TFLOP barrier

Answer 2 · 2023-04-06T15:28:15.000Z

Hi @JorgeG94, thanks for opening this issue.

hipBLAS is just a wrapper library for rocBLAS/cuBLAS backends. rocBLAS then uses the Tensile library for calls to gemm. Since you're looking for better performance in dgemm, I think it will be best if I transfer this issue to the Tensile library where they can hopefully help you out. Performance tuning done there will be realized in rocBLAS and hipBLAS w/ AMD backend.

Thanks,
Daine

Answer 3 · 2023-04-11T17:09:00.000Z

I will check this on my side.
Does the performance drop happen only with this size?
Have you checked other sizes and/or orientations?

Answer 4 · 2024-07-15T20:26:54.000Z

@JorgeG94 Can you please test with the latest ROCm 6.1.2? If your issue is resolved, please close the ticket. Thanks!