hipblasdgemm not getting close to peak
JorgeG94 opened this issue · 6 comments
What is the expected behavior
- I would expect a dgemm of sizeable input to achieve close to the 47.9 TFLOP/s
What actually happens
- By using the code provided in: https://github.com/JorgeG94/calum_performance_tool it can be seen that rocm/5.4.0 produces 38TFLOP/s
How to reproduce
- The code in the repo https://github.com/JorgeG94/calum_performance_tool has the readme, but basically:
hipcc -L/opt/rocm-5.4.3/lib -lhipblas --offload-arch=gfx90a performance.cpp
./a.out 36000 14400 36000 10 T T
Environment
Hardware | description |
---|---|
GPU | MI250x |
CPU | AMD Optimized 3rd Gen EPYC |
Software | version |
---|---|
ROCM | v5.4.0 |
I've tried larger sizes and at some point the code just breaks without ever breaking the 40 TFLOP barrier
Hi @JorgeG94, thanks for opening this issue.
hipBLAS is just a wrapper library for rocBLAS/cuBLAS backends. rocBLAS then uses the Tensile library for calls to gemm. Since you're looking for better performance in dgemm, I think it will be best if I transfer this issue to the Tensile library where they can hopefully help you out. Performance tuning done there will be realized in rocBLAS and hipBLAS w/ AMD backend.
Thanks,
Daine
I will check this on my side.
Does the performance drop happen only with this size?
Have you checked other sizes and/or orientations?
@JorgeG94 Can you please test with the latest ROCm 6.1.2? If your issue is resolved, please close the ticket. Thanks!