tpoisonooo/how-to-optimize-gemm

cuda版本非m=n=k运算出错

seth-lu opened this issue · 1 comments

如kernel_v3中:
float *begin_a = a + by * BLOCK * k; //by->n
float *begin_b = b + bx * BLOCK; //bx->m

当A,B不为方阵时会出错,例如m=k=256,n=128.

version 3 fixed for lost in urgly leading dimension x .

Still, I need to add more CI and unittest.