Semi-Sparse Matrix Multiplication

Modify from CUDA Tensor Core GEMM sample.

将matrix分成16*16的tile。对于每个tile,记录其大小,并用四个32-bit data type记录每个element是否为0,比如下面的tile就是:0xC848C848, 0xC848C848, 0xC848C848, 0xC848C848

0.5, 1, 0, 0, 0.3, 0, 0, 0, 0, 4.0, 0, 0, 5.6, 0, 0, 0
0.5, 1, 0, 0, 0.3, 0, 0, 0, 0, 4.0, 0, 0, 5.6, 0, 0, 0
0.5, 1, 0, 0, 0.3, 0, 0, 0, 0, 4.0, 0, 0, 5.6, 0, 0, 0
0.5, 1, 0, 0, 0.3, 0, 0, 0, 0, 4.0, 0, 0, 5.6, 0, 0, 0
0.5, 1, 0, 0, 0.3, 0, 0, 0, 0, 4.0, 0, 0, 5.6, 0, 0, 0
0.5, 1, 0, 0, 0.3, 0, 0, 0, 0, 4.0, 0, 0, 5.6, 0, 0, 0
0.5, 1, 0, 0, 0.3, 0, 0, 0, 0, 4.0, 0, 0, 5.6, 0, 0, 0
0.5, 1, 0, 0, 0.3, 0, 0, 0, 0, 4.0, 0, 0, 5.6, 0, 0, 0
0.5, 1, 0, 0, 0.3, 0, 0, 0, 0, 4.0, 0, 0, 5.6, 0, 0, 0
0.5, 1, 0, 0, 0.3, 0, 0, 0, 0, 4.0, 0, 0, 5.6, 0, 0, 0
0.5, 1, 0, 0, 0.3, 0, 0, 0, 0, 4.0, 0, 0, 5.6, 0, 0, 0
0.5, 1, 0, 0, 0.3, 0, 0, 0, 0, 4.0, 0, 0, 5.6, 0, 0, 0
0.5, 1, 0, 0, 0.3, 0, 0, 0, 0, 4.0, 0, 0, 5.6, 0, 0, 0
0.5, 1, 0, 0, 0.3, 0, 0, 0, 0, 4.0, 0, 0, 5.6, 0, 0, 0
0.5, 1, 0, 0, 0.3, 0, 0, 0, 0, 4.0, 0, 0, 5.6, 0, 0, 0
0.5, 1, 0, 0, 0.3, 0, 0, 0, 0, 4.0, 0, 0, 5.6, 0, 0, 0

那么实际的数据只需要储存
0.5, 1, 0.3, 4.0, 5.6, 0.5, 1, 0.3, 4.0, 5.6, 0.5, 1, 0.3, 4.0, 5.6, 0.5, 1, 0.3, 4.0, 5.6, 0.5, 1, 0.3, 4.0, 5.6, 0.5, 1, 0.3, 4.0, 5.6, 0.5, 1, 0.3, 4.0, 5.6, 0.5, 1, 0.3, 4.0, 5.6, 0.5, 1, 0.3, 4.0, 5.6, 0.5, 1, 0.3, 4.0, 5.6, 0.5, 1, 0.3, 4.0, 5.6, 0.5, 1, 0.3, 4.0, 5.6, 0.5, 1, 0.3, 4.0, 5.6, 0.5, 1, 0.3, 4.0, 5.6, 0.5, 1, 0.3, 4.0, 5.6, 0.5, 1, 0.3, 4.0, 5.6

计算时,先把压缩的数据load和bitmask都load到register里,然后用bitmask来把原来的矩阵恢复到shared mem里。最后feed到tensor core里计算。这样节约了很大一部分数据load的时间。

这种pattern在DL里很常见(RELU)


Update: Still Under Developing, Seems that the performance is not good, as GEMM is compute intensive....