DefTruth/CUDA-Learn-Notes
📚150+ Tensor/CUDA Cores Kernels, ⚡️flash-attn-mma, ⚡️hgemm with WMMA, MMA and CuTe (98%~100% TFLOPS of cuBLAS/FA2 🎉🎉).
CudaGPL-3.0
Pinned issues
Issues
- 3
Include SageAttention Kernel
#147 opened by jason-huang03 - 1
🌤🌤 CONTRIBUTE 🎉🎉
#50 opened by DefTruth - 2
你好,关于online safe softmax的速度,貌似并没有明显提升
#153 opened by lzcchl - 2
关于softmax中的实现的理解,求大佬解惑
#151 opened by lzcchl - 3
您好,请教一个关于代码中reduce相关的问题
#6 opened by Ss-shuang123 - 3
__threadfence() 作用
#7 opened by zbt78 - 3
layer norm实现
#2 opened by zbt78 - 2
您好,请问sigmoid算子这里为啥没有考虑指数溢出问题
#4 opened by Phoenix8215 - 2