ROCm/Tensile

hipblasdgemm not getting close to peak

JorgeG94 opened this issue · 6 comments

What is the expected behavior

  • I would expect a dgemm of sizeable input to achieve close to the 47.9 TFLOP/s

What actually happens

How to reproduce

Environment

Hardware description
GPU MI250x
CPU AMD Optimized 3rd Gen EPYC
Software version
ROCM v5.4.0

I've tried larger sizes and at some point the code just breaks without ever breaking the 40 TFLOP barrier

Hi @JorgeG94, thanks for opening this issue.

hipBLAS is just a wrapper library for rocBLAS/cuBLAS backends. rocBLAS then uses the Tensile library for calls to gemm. Since you're looking for better performance in dgemm, I think it will be best if I transfer this issue to the Tensile library where they can hopefully help you out. Performance tuning done there will be realized in rocBLAS and hipBLAS w/ AMD backend.

Thanks,
Daine

I will check this on my side.
Does the performance drop happen only with this size?
Have you checked other sizes and/or orientations?

@JorgeG94 Can you please test with the latest ROCm 6.1.2? If your issue is resolved, please close the ticket. Thanks!