Performance gemv vs gemm
JanAbbing opened this issue · 5 comments
Hello,
I encountered a weird runtime difference between the gemv and the gemm routine.
When I run both with the Input: M=4096, N=1, K=4096 on my GTX480 the runtime of the gemm routine is 3.04ms and the runtime of the gemv routine is 5.51ms. I would have expected that gemv would be faster than the gemm routine because it is made for such an input. Could it be that gemv isn't yet optimized for a GTX480 or is it normal that it is slower? The cuBLASSgemm is slower than cuBLASSgemv (almost 2 times faster).
I call the gemv routine like this:
./clblast_client_xgemv -m 4096 -n 4096 -alpha 1 -beta 0 -warm_up true -runs 100
Greetings,
Jan
You are right, the GEMV kernel is not particularly fast if the matrix is rotated. It's been a while since I looked at it and I completely forgot about it. But you can see similar results if you look in the doc/performance
folder of CLBlast. In fact, there is a GTX480 graph included as well. You can generate such a graph on your own system as well with the included tools (see README).
Also on my system with the latest version of CLBlast I see this behaviour:
./clblast_client_xgemv -m 4096 -n 4096 -alpha 1 -beta 0 -warm_up -layout 101
| <-- CLBlast --> | <-- clBLAS --> |
m; n; layout; transA; lda; incx; incy; offa; offx; offy; alpha; beta; ms_1; GFLOPS_1; GBs_1; ms_2; GFLOPS_2; GBs_2;
4K; 4K; 101; 111; 4K; 1; 1; 0; 0; 0; 1.000000; 0.000000; 20.28; 1.7; 3.3; 5.87; 5.7; 11.4;
./clblast_client_xgemv -m 4096 -n 4096 -alpha 1 -beta 0 -warm_up -layout 102
| <-- CLBlast --> | <-- clBLAS --> |
m; n; layout; transA; lda; incx; incy; offa; offx; offy; alpha; beta; ms_1; GFLOPS_1; GBs_1; ms_2; GFLOPS_2; GBs_2;
4K; 4K; 102; 111; 4K; 1; 1; 0; 0; 0; 1.000000; 0.000000; 1.70; 19.7; 39.4; 2.77; 12.1; 24.3;
For now, you can get decent performance again if you rotate the matrix (either use column-major layout or set the transpose option). I'll take a more in-depth look at the kernel soon and try to improve it for rotated matrices. I'll keep you up-to-date.
I've designed a new kernel for the rotated case. It has much better data locality since it now loads a tile of matrix A into the local memory. This also enables coalescing. On my device this already improves performance to the clBLAS level (old and new experiments below each other for comparison):
./clblast_client_xgemv -m 4096 -n 4096 -alpha 1 -beta 0 -warm_up -layout 101
| <-- CLBlast --> | <-- clBLAS --> |
m; n; layout; transA; lda; incx; incy; offa; offx; offy; alpha; beta; ms_1; GFLOPS_1; GBs_1; ms_2; GFLOPS_2; GBs_2;
4K; 4K; 101; 111; 4K; 1; 1; 0; 0; 0; 1.000000; 0.000000; 20.28; 1.7; 3.3; 5.87; 5.7; 11.4; <--- old
4K; 4K; 101; 111; 4K; 1; 1; 0; 0; 0; 1.000000; 0.000000; 4.77; 7.0; 14.1; 5.79; 5.8; 11.6; <--- new
The new kernel can already be found in the gemv_performance
branch. However, I'll need to make some changes to the tuning database (preferably re-tune for all devices) since the kernel has changed significantly. I will also try to run it on an GTX 480 or similar to verify that performance has improved.
when this is ready for rhe main, please put a visible reminder, and i'll retune clblast for the devices i have.
This is now merged into the development
branch. @JanAbbing Could you re-run your experiment again and verify if this issue is fixed? If not directly fast with the default settings, could you re-tune it for your GPU and upload the corresponding JSON files here? Thanks!
It improved quite a bit to 2,17 ms.
Thanks for the help!