CNugteren/CLBlast

Performance gemv vs gemm

JanAbbing opened this issue · 5 comments

Hello,

I encountered a weird runtime difference between the gemv and the gemm routine.
When I run both with the Input: M=4096, N=1, K=4096 on my GTX480 the runtime of the gemm routine is 3.04ms and the runtime of the gemv routine is 5.51ms. I would have expected that gemv would be faster than the gemm routine because it is made for such an input. Could it be that gemv isn't yet optimized for a GTX480 or is it normal that it is slower? The cuBLASSgemm is slower than cuBLASSgemv (almost 2 times faster).

I call the gemv routine like this:
./clblast_client_xgemv -m 4096 -n 4096 -alpha 1 -beta 0 -warm_up true -runs 100

Greetings,
Jan

You are right, the GEMV kernel is not particularly fast if the matrix is rotated. It's been a while since I looked at it and I completely forgot about it. But you can see similar results if you look in the doc/performance folder of CLBlast. In fact, there is a GTX480 graph included as well. You can generate such a graph on your own system as well with the included tools (see README).

Also on my system with the latest version of CLBlast I see this behaviour:

./clblast_client_xgemv -m 4096 -n 4096 -alpha 1 -beta 0 -warm_up -layout 101
                                                                                                                         | <--       CLBlast       --> | <--       clBLAS        --> |
        m;        n;   layout;   transA;      lda;     incx;     incy;     offa;     offx;     offy;    alpha;     beta;     ms_1; GFLOPS_1;    GBs_1;     ms_2; GFLOPS_2;    GBs_2;  
       4K;       4K;      101;      111;       4K;        1;        1;        0;        0;        0; 1.000000; 0.000000;    20.28;      1.7;      3.3;     5.87;      5.7;     11.4;  

./clblast_client_xgemv -m 4096 -n 4096 -alpha 1 -beta 0 -warm_up -layout 102
                                                                                                                         | <--       CLBlast       --> | <--       clBLAS        --> |
        m;        n;   layout;   transA;      lda;     incx;     incy;     offa;     offx;     offy;    alpha;     beta;     ms_1; GFLOPS_1;    GBs_1;     ms_2; GFLOPS_2;    GBs_2;  
       4K;       4K;      102;      111;       4K;        1;        1;        0;        0;        0; 1.000000; 0.000000;     1.70;     19.7;     39.4;     2.77;     12.1;     24.3;  

For now, you can get decent performance again if you rotate the matrix (either use column-major layout or set the transpose option). I'll take a more in-depth look at the kernel soon and try to improve it for rotated matrices. I'll keep you up-to-date.

I've designed a new kernel for the rotated case. It has much better data locality since it now loads a tile of matrix A into the local memory. This also enables coalescing. On my device this already improves performance to the clBLAS level (old and new experiments below each other for comparison):

./clblast_client_xgemv -m 4096 -n 4096 -alpha 1 -beta 0 -warm_up -layout 101
                                                                                                                         | <--       CLBlast       --> | <--       clBLAS        --> |
        m;        n;   layout;   transA;      lda;     incx;     incy;     offa;     offx;     offy;    alpha;     beta;     ms_1; GFLOPS_1;    GBs_1;     ms_2; GFLOPS_2;    GBs_2;  
       4K;       4K;      101;      111;       4K;        1;        1;        0;        0;        0; 1.000000; 0.000000;    20.28;      1.7;      3.3;     5.87;      5.7;     11.4;    <--- old
       4K;       4K;      101;      111;       4K;        1;        1;        0;        0;        0; 1.000000; 0.000000;     4.77;      7.0;     14.1;     5.79;      5.8;     11.6;    <--- new

The new kernel can already be found in the gemv_performance branch. However, I'll need to make some changes to the tuning database (preferably re-tune for all devices) since the kernel has changed significantly. I will also try to run it on an GTX 480 or similar to verify that performance has improved.

when this is ready for rhe main, please put a visible reminder, and i'll retune clblast for the devices i have.

This is now merged into the development branch. @JanAbbing Could you re-run your experiment again and verify if this issue is fixed? If not directly fast with the default settings, could you re-tune it for your GPU and upload the corresponding JSON files here? Thanks!

It improved quite a bit to 2,17 ms.

Thanks for the help!