ROCm/rocSOLVER

Low performance of DSYEVD

Madu86 opened this issue · 4 comments

Hello, I am currently profiling a Quantum chemistry code that I recently hipified. The algorithm involves iteratively diagonalizing a symmetric matrix and for this purpose, I hooked up DSYEVD from rocsolver. From the initial timings, I notice that diagonalization is significantly slow and in fact kills the overall performance. See the two snapshots below where I compare a computation (that involves 20 iterations) on MI100 with that of NVIDIA V100. For the latter, we use Dndsyevd from cusolver. As apparent from column 8 (DIAG_TIME) of the tables, rocsolver DSYEVD diagonalization is surprisingly slow. Can someone help me to figure out what's happening here?

rocsolver_ dsyevd on MI100
Screen Shot 2021-11-21 at 2 01 18 PM

cusolver_dndsyevd on V100

Screen Shot 2021-11-21 at 2 00 54 PM

I have attached a test program where I hooked up the same fortran drivers and diagonalize the same matrix in the first iteration of the above example.
To compile and run the example, simply run make and make run. This should print diagnolization time with and without memory operations ("Time rocDIAG" and "Time rocsolver_dsyevd" respectively). Hope this will be useful. Thanks!
roctest.tar.gz
.

cgmb commented

I enabled profile logging and ran your benchmark on a Radeon VII (with the current rocSOLVER / rocBLAS develop branch).

cgmb@8958-ubuntu-18.04-stg1:~/roctest$ ROCSOLVER_LAYER=4 ROCSOLVER_LEVELS=10 ./roctest 
 Time rocsolver_dsyevd (s):    19.062556999999998     
------- PROFILE -------
rocsolver_syevd_heevd_template: Calls: 1, Total Time: 19068.755 ms (in nested functions: 19065.926 ms)
    rocsolver_ormtr_unmtr_template: Calls: 1, Total Time: 57.516 ms (in nested functions: 57.513 ms)
        rocsolver_ormqr_unmqr_template: Calls: 1, Total Time: 57.513 ms (in nested functions: 57.497 ms)
            rocsolver_larfb_template: Calls: 12, Total Time: 36.826 ms (in nested functions: 36.045 ms)
                rocblas_gemm_template: Calls: 22, Total Time: 1.338 ms
                rocblas_trmm_template: Calls: 36, Total Time: 34.707 ms
            rocsolver_larft_template: Calls: 12, Total Time: 20.671 ms (in nested functions: 19.198 ms)
                rocblas_trmv_template: Calls: 341, Total Time: 13.964 ms
                rocblas_gemv_template: Calls: 341, Total Time: 5.234 ms
    rocsolver_sytrd_hetrd_template: Calls: 1, Total Time: 121.737 ms (in nested functions: 121.715 ms)
        rocsolver_sytd2_hetd2_template: Calls: 1, Total Time: 6.137 ms (in nested functions: 4.593 ms)
            rocblas_dot_template: Calls: 33, Total Time: 0.459 ms
            rocblas_syr2_template: Calls: 33, Total Time: 2.277 ms
            rocsolver_larfg_template: Calls: 33, Total Time: 1.010 ms (in nested functions: 0.906 ms)
                rocblas_scal_template: Calls: 32, Total Time: 0.457 ms
                rocblas_dot_template: Calls: 32, Total Time: 0.449 ms
            rocblas_symv_template: Calls: 33, Total Time: 0.847 ms
        rocsolver_latrd_template: Calls: 10, Total Time: 108.501 ms (in nested functions: 105.332 ms)
            rocblas_dot_template: Calls: 320, Total Time: 4.405 ms
            rocblas_scal_template: Calls: 320, Total Time: 3.951 ms
            rocblas_symv_template: Calls: 320, Total Time: 19.856 ms
            rocblas_gemv_template: Calls: 1920, Total Time: 54.097 ms
            rocsolver_larfg_template: Calls: 320, Total Time: 23.023 ms (in nested functions: 21.628 ms)
                rocblas_scal_template: Calls: 320, Total Time: 8.426 ms
                rocblas_dot_template: Calls: 320, Total Time: 13.202 ms
        rocblas_syr2k_template: Calls: 10, Total Time: 7.077 ms
    rocsolver_stedc_template: Calls: 1, Total Time: 18886.673 ms

 Time rocDIAG (s):   19.320864000000000     
cgmb commented

Enabling kernel logging (ROCSOLVER_LAYER=0x14), it seems that roughly 99% of the runtime is spent in stedc_kernel.

ROCm 4.5 is the first release that includes DSYEVD, and it hasn't seen significant optimization yet. I see that the comment above STEDC notes this is a very basic implementation that will only effectively utilize the GPU for batches of matrices.

We will definitely need to optimize this.

@cgmb I see, thanks for the explanation. I will temporarily disable rocSolver dsyevd in my code and use a host diagonalizer. I look forward for an optimized version in a future ROCm release.

A resolution is implemented and will be available in the upcoming ROCm 5.5 release.