Low performance of xPOTRF.
rasolca opened this issue · 4 comments
The Cholesky decomposition doens't performs well on a MI50.
Using 10240 matrices the double precision performance is just 150 GFlop/s, and increasing the matrix size to 20480 the performance even decrease to 110 GFlop/s.
Full output of https://gist.github.com/rasolca/8a302639a75f79bfb3f767a2b4ab3014:
iteration: 0, size: 10240
Perf: 59.6001 GFlop/s
iteration: 1, size: 10240
Perf: 159.634 GFlop/s
iteration: 2, size: 10240
Perf: 159.479 GFlop/s
iteration: 3, size: 10240
Perf: 158.875 GFlop/s
iteration: 4, size: 10240
Perf: 159.327 GFlop/s
iteration: 0, size: 20480
Perf: 110.742 GFlop/s
iteration: 1, size: 20480
Perf: 110.517 GFlop/s
iteration: 2, size: 20480
Perf: 109.787 GFlop/s
iteration: 3, size: 20480
Perf: 109.32 GFlop/s
iteration: 4, size: 20480
Perf: 109.335 GFlop/s
As comparison an Nvidia P100 reaches ~70% of the peak performance with a 10240 matrix using cuSolver.
The poor performance is likely due to the implementation of xPOTF2, which is implemented with many BLAS level2 kernel calls.
Thanks for reaching out and open this issue. rocSOLVER is under active development; the team adds new functionality as mere functional API versions first, and then we optimize the code according to priorities and users' requests. The optimization of POTRF is already on the radar; we will be working on it and get back to you soon.
@rasolca Can you please test with the latest ROCm 6.1.2? If issue is resolved, please close the ticket. Thanks!
After a quick test I confirm a 5x performance improvement of rocm 6.1.2 compared to rocm 5.2.3