DefTruth/CUDA-Learn-Notes

🌤🌤 CONTRIBUTE 🎉🎉

DefTruth opened this issue · 1 comments

🌤🌤目标

首先,任何kernel实现都欢迎,本仓库学习/练习为主,性能最优非本仓库最终目标,先会用,然后再用好。性能最优推荐直接使用cuBLAS, cuDNN, FlashAttention, TensorRT等官方实现。如果有感兴趣的kernel希望在本仓库实现,可以评论本issue(虽然我不一定有能力实现🌚),比如:

☕️☕️Kernel Trace

  • xxx kernel
  • ...

👨‍💻👨‍💻代码规范

提交代码需要遵循以下规范:

  • 每个算子/kernel使用单独的目录,参考relu, gelu等
  • 参考任一现有kernel,使用torch验证结果正确性
  • 本仓库使用2空格作为缩进
  • { }使用非上下对齐风格
  • pragma unroll和当前for loop代码对齐
  • 一行代码尽量不超过100字符
  • 不使用/未ready的代码删除
  • 其他想到再写🌚......

🎉🎉 致谢

感谢 @bear-zd, @wangzijian1010等为本仓库提供大量kernel实现 ~

☕️☕️Kernel Trace

  • swish kernel #85 by @wangzijian1010
  • gelu kernel #66 by @bear-zd
  • RoPE kernel #80 by @bear-zd
  • pack elementwise_add
  • pack sigmoid
  • pack relu
  • histogram
  • warp/block reduce
  • softmax
  • pack safe_softmax
  • pack layer-norm
  • pack rms-norm
  • flash-attn-1 f32
  • flash-attn-2 f16
  • flash-attn-3 f8 ada
  • MMA(Tensor Cores) flash-attn-2 f16
  • warp segmv
  • warp hgemv
  • bank confilcts reduce sgemm
  • pipeling sgemm
  • split_k sgemm
  • pack LDST hgemm
  • bank confilcts reduce hgemm
  • pipeling hgemm
  • split_k hgemm
  • cp.async hgemm
  • cp.async sgemm
  • WMMA API(Tensor Cores) hgemm
  • stage2/3/4 (Tensor Cores) hgemm
  • MMA PTX(Tensor Cores) hgemm
  • TF32 WMMA(Tensor Cores) sgemm
  • online_safe_softmax f32 #60 by @bear-zd
  • pack online_safe_softmax #73 by @bear-zd
  • embedding pack/unpack kernel #68 by @bear-zd
  • mat transpose kernel #89 by @bear-zd
  • Thread Block Swizzle hgemm
  • Thread Block Swizzle sgemm TF32
  • batchnorm kernel
  • nms kernel #102 by @bear-zd
  • HGEMM MMA Col Major
  • ... (any kernel)

This issue is stale because it has been open for 30 days with no activity.