tpoisonooo/how-to-optimize-gemm

how to overlap the share2register and computing process?

YijiaZhao opened this issue · 6 comments

I have another question about MMult_cuda_12.cu
Honestly, I don't know how to overlap the share2register and computing process. Is it the asm(PTX) that make them run parallelly? The instructions are sequantially, so how could these two parts of code hide each other?
part1: loading shared-memory to panel
lds128(panelA[pp][0], panelA[pp][1], panelA[pp][2], panelA[pp][3],
aptr_base + ((subk + 1) % 8) * SMEM_LDA * sizeof(float));
lds128(panelA[pp][4], panelA[pp][5], panelA[pp][6], panelA[pp][7],
aptr_base + (((subk + 1) % 8) * SMEM_LDA + 64) * sizeof(float));
lds128(panelB[pp][0], panelB[pp][1], panelB[pp][2], panelB[pp][3],
bptr_base + ((subk + 1) % 8) * SMEM_LDB * sizeof(float));
lds128(panelB[pp][4], panelB[pp][5], panelB[pp][6], panelB[pp][7],
bptr_base + (((subk + 1) % 8) * SMEM_LDB + 64) * sizeof(float));

part2: computing the result of panel-data
#pragma unroll
for (int i = 0; i < 8; ++i) {
#pragma unroll
for (int j = 0; j < 8; ++j) {
sum[i][j] += panelA[subk % 2][i] * panelB[subk % 2][j];
}
}

事情有点弯弯绕,用中文表达力好点,勿怪。

做 ping-pong 的前提是要有两个相互独立的主体。在计算机体系结构里,运算用 ALU、搬数据用 MMU,这俩就是两个独立主体——一边算、一边搬。

字面上,发命令本身只需要 1 一个cycle,搬数据整个动作要 100 个 cycle。

形象一点举例:你有个马仔,你命令马仔去卖冰,你本人负责制造冰。伪代码里你的工作:

造冰(0) // 100 个 cycle
卖冰(马仔,ptr_冰0)  // 1 个 cycle 发指令
造冰(1) // 100 个 cycle
卖冰(马仔,ptr_冰1)  // 1 个 cycle 发指令

马仔的工作:

recv_卖冰_cmd(ptr_冰0)
do_卖冰(ptr_冰0) // 100 个 cycle

recv_卖冰_cmd(ptr_冰1)
do_卖冰(ptr_冰1) // 100 个 cycle

这时候完成了并行化, 302 (202 + 100) 个 cycle 后任务结束,两个主体一共做了 402 cycle 的工作.

回到原始问题上, part1 和 part2 代码上是串行, 执行由不同硬件来.

冰 == 冰粉, 成都美食.

Thank you for your reply. There's no Sync between the part1 and part2, so I think that part1 and part2 run sequentially. I asked my colleague and he said that part1 and part2 are parallel in the hardware and it is register that ensure s2r is finished before computing. His explanation is same as what you said.

I asked him by using cutlass code which has same pipeline as your code. I also want to know why you use PTX, what's the advantage of asm code?

The PTX on cuda is not powerful as __asm__ on CPU.

You can just use C code, the gflops should be same.