Low performance of DSYEVD
Madu86 opened this issue · 4 comments
Hello, I am currently profiling a Quantum chemistry code that I recently hipified. The algorithm involves iteratively diagonalizing a symmetric matrix and for this purpose, I hooked up DSYEVD from rocsolver. From the initial timings, I notice that diagonalization is significantly slow and in fact kills the overall performance. See the two snapshots below where I compare a computation (that involves 20 iterations) on MI100 with that of NVIDIA V100. For the latter, we use Dndsyevd from cusolver. As apparent from column 8 (DIAG_TIME) of the tables, rocsolver DSYEVD diagonalization is surprisingly slow. Can someone help me to figure out what's happening here?
cusolver_dndsyevd on V100
I have attached a test program where I hooked up the same fortran drivers and diagonalize the same matrix in the first iteration of the above example.
To compile and run the example, simply run make
and make run
. This should print diagnolization time with and without memory operations ("Time rocDIAG" and "Time rocsolver_dsyevd" respectively). Hope this will be useful. Thanks!
roctest.tar.gz
.
I enabled profile logging and ran your benchmark on a Radeon VII (with the current rocSOLVER / rocBLAS develop branch).
cgmb@8958-ubuntu-18.04-stg1:~/roctest$ ROCSOLVER_LAYER=4 ROCSOLVER_LEVELS=10 ./roctest
Time rocsolver_dsyevd (s): 19.062556999999998
------- PROFILE -------
rocsolver_syevd_heevd_template: Calls: 1, Total Time: 19068.755 ms (in nested functions: 19065.926 ms)
rocsolver_ormtr_unmtr_template: Calls: 1, Total Time: 57.516 ms (in nested functions: 57.513 ms)
rocsolver_ormqr_unmqr_template: Calls: 1, Total Time: 57.513 ms (in nested functions: 57.497 ms)
rocsolver_larfb_template: Calls: 12, Total Time: 36.826 ms (in nested functions: 36.045 ms)
rocblas_gemm_template: Calls: 22, Total Time: 1.338 ms
rocblas_trmm_template: Calls: 36, Total Time: 34.707 ms
rocsolver_larft_template: Calls: 12, Total Time: 20.671 ms (in nested functions: 19.198 ms)
rocblas_trmv_template: Calls: 341, Total Time: 13.964 ms
rocblas_gemv_template: Calls: 341, Total Time: 5.234 ms
rocsolver_sytrd_hetrd_template: Calls: 1, Total Time: 121.737 ms (in nested functions: 121.715 ms)
rocsolver_sytd2_hetd2_template: Calls: 1, Total Time: 6.137 ms (in nested functions: 4.593 ms)
rocblas_dot_template: Calls: 33, Total Time: 0.459 ms
rocblas_syr2_template: Calls: 33, Total Time: 2.277 ms
rocsolver_larfg_template: Calls: 33, Total Time: 1.010 ms (in nested functions: 0.906 ms)
rocblas_scal_template: Calls: 32, Total Time: 0.457 ms
rocblas_dot_template: Calls: 32, Total Time: 0.449 ms
rocblas_symv_template: Calls: 33, Total Time: 0.847 ms
rocsolver_latrd_template: Calls: 10, Total Time: 108.501 ms (in nested functions: 105.332 ms)
rocblas_dot_template: Calls: 320, Total Time: 4.405 ms
rocblas_scal_template: Calls: 320, Total Time: 3.951 ms
rocblas_symv_template: Calls: 320, Total Time: 19.856 ms
rocblas_gemv_template: Calls: 1920, Total Time: 54.097 ms
rocsolver_larfg_template: Calls: 320, Total Time: 23.023 ms (in nested functions: 21.628 ms)
rocblas_scal_template: Calls: 320, Total Time: 8.426 ms
rocblas_dot_template: Calls: 320, Total Time: 13.202 ms
rocblas_syr2k_template: Calls: 10, Total Time: 7.077 ms
rocsolver_stedc_template: Calls: 1, Total Time: 18886.673 ms
Time rocDIAG (s): 19.320864000000000
Enabling kernel logging (ROCSOLVER_LAYER=0x14
), it seems that roughly 99% of the runtime is spent in stedc_kernel
.
ROCm 4.5 is the first release that includes DSYEVD, and it hasn't seen significant optimization yet. I see that the comment above STEDC notes this is a very basic implementation that will only effectively utilize the GPU for batches of matrices.
We will definitely need to optimize this.
@cgmb I see, thanks for the explanation. I will temporarily disable rocSolver dsyevd in my code and use a host diagonalizer. I look forward for an optimized version in a future ROCm release.
A resolution is implemented and will be available in the upcoming ROCm 5.5 release.