Load with ldmatrix
liyanc opened this issue · 3 comments
Hello,
I'm curious if the implementation adopts the ldmatrix
instruction for loading tiles from shared memory to registers.
It seems the current version didn't implement load()
with explicit ldmatrix
per https://github.com/HazyResearch/ThunderKittens/blob/a562ed2569c45b0ffea844688594158cb7c6e858/src/ops/group/memory/tile/shared_to_register.cuh#L27.
Will nvcc compile the function to ldmatrix
or the authors intend to include ldmatrix
in a future step?
neither? we didn't really think it was worth dealing explicitly with the shared memory layout implied by ldmatrix/stmatrix, and doing it directly with swizzling seemed fast enough. So at the moment, no plans to add. But if we're wrong on this point and it would meaningfully unlock some more performance, could be persuaded.
ldmatrix can refer to loading a 16x16 matrix with a single instruction, while LDS.32 requires 4 instructions, and ldmatrix also offers a transposition function.