Bruce-Lee-LY/cuda_hgemm
Several optimization methods of half-precision general matrix multiplication (HGEMM) using tensor core with WMMA API and MMA PTX instruction.
CudaMIT
Issues
- 2
enable_check 1 结果不对
#12 opened by cokeshao - 2
关于permute实现方式
#8 opened by feiyuvl - 0
请教一个 `wmma_async_stage2.cu` 中的代码细节
#9 opened by luliyucoordinate - 0
为什么B矩阵要transpose?
#10 opened by luliyucoordinate - 0
- 1
关于A/B阵的Layout
#7 opened by feiyuvl - 2
Question about the tiling size
#6 opened by macto94 - 2
Cooperative Async Copies
#5 opened by FabianSchuetze - 1
咨询:Share Mem bank Confict.
#4 opened by matrix97317 - 3
Change to block of 128 by 256
#3 opened by yupei-ms - 1
#define CHUNK_K 2 // 32 / WMMA_K
#2 opened by lk137095576 - 1
mma_naive结果不正确
#1 opened by FdyCN