🌤🌤 CONTRIBUTE 🎉🎉
DefTruth opened this issue · 1 comments
DefTruth commented
🌤🌤目标
首先,任何kernel实现都欢迎,本仓库学习/练习为主,性能最优非本仓库最终目标,先会用,然后再用好。性能最优推荐直接使用cuBLAS, cuDNN, FlashAttention, TensorRT等官方实现。如果有感兴趣的kernel希望在本仓库实现,可以评论本issue(虽然我不一定有能力实现🌚),比如:
☕️☕️Kernel Trace
- xxx kernel
- ...
👨💻👨💻代码规范
提交代码需要遵循以下规范:
- 每个算子/kernel使用单独的目录,参考relu, gelu等
- 参考任一现有kernel,使用torch验证结果正确性
- 本仓库使用2空格作为缩进
- { }使用非上下对齐风格
- pragma unroll和当前for loop代码对齐
- 一行代码尽量不超过100字符
- 不使用/未ready的代码删除
- 其他想到再写🌚......
🎉🎉 致谢
感谢 @bear-zd, @wangzijian1010等为本仓库提供大量kernel实现 ~
☕️☕️Kernel Trace
- swish kernel #85 by @wangzijian1010
- gelu kernel #66 by @bear-zd
- RoPE kernel #80 by @bear-zd
- pack elementwise_add
- pack sigmoid
- pack relu
- histogram
- warp/block reduce
- softmax
- pack safe_softmax
- pack layer-norm
- pack rms-norm
- flash-attn-1 f32
- flash-attn-2 f16
- flash-attn-3 f8 ada
- MMA(Tensor Cores) flash-attn-2 f16
- warp segmv
- warp hgemv
- bank confilcts reduce sgemm
- pipeling sgemm
- split_k sgemm
- pack LDST hgemm
- bank confilcts reduce hgemm
- pipeling hgemm
- split_k hgemm
- cp.async hgemm
- cp.async sgemm
- WMMA API(Tensor Cores) hgemm
- stage2/3/4 (Tensor Cores) hgemm
- MMA PTX(Tensor Cores) hgemm
- TF32 WMMA(Tensor Cores) sgemm
- online_safe_softmax f32 #60 by @bear-zd
- pack online_safe_softmax #73 by @bear-zd
- embedding pack/unpack kernel #68 by @bear-zd
- mat transpose kernel #89 by @bear-zd
- Thread Block Swizzle hgemm
- Thread Block Swizzle sgemm TF32
- batchnorm kernel
- nms kernel #102 by @bear-zd
- HGEMM MMA Col Major
- ... (any kernel)
github-actions commented
This issue is stale because it has been open for 30 days with no activity.
