HazyResearch/ThunderKittens

Load with ldmatrix

liyanc opened this issue · 3 comments

Hello,

I'm curious if the implementation adopts the ldmatrix instruction for loading tiles from shared memory to registers.
It seems the current version didn't implement load() with explicit ldmatrix per https://github.com/HazyResearch/ThunderKittens/blob/a562ed2569c45b0ffea844688594158cb7c6e858/src/ops/group/memory/tile/shared_to_register.cuh#L27.
Will nvcc compile the function to ldmatrix or the authors intend to include ldmatrix in a future step?

neither? we didn't really think it was worth dealing explicitly with the shared memory layout implied by ldmatrix/stmatrix, and doing it directly with swizzling seemed fast enough. So at the moment, no plans to add. But if we're wrong on this point and it would meaningfully unlock some more performance, could be persuaded.

ldmatrix can refer to loading a 16x16 matrix with a single instruction, while LDS.32 requires 4 instructions, and ldmatrix also offers a transposition function.