Pinned Repositories
AutoAWQ
AutoAWQ implements the AWQ algorithm for 4-bit quantization with a 2x speedup during inference. Documentation:
AutoFP8
AutoGPTQ
An easy-to-use LLMs quantization package with user-friendly apis, based on GPTQ algorithm.
cub
[ARCHIVED] Cooperative primitives for CUDA C++. See https://github.com/NVIDIA/cccl
Cute_exercise
Cute_exercise
cute_gemm
cutlass_flash_atten_fp8
使用 cutlass 仓库在 ada 架构上实现 fp8 的 flash attention
flash-attention
Fast and memory-efficient exact attention
MoE
MoE layer for pytorch
tiny-flash-attention
使用 cutlass 实现 flash-attention 精简版,具有教学意义
weishengying's Repositories
weishengying/cutlass_flash_atten_fp8
使用 cutlass 仓库在 ada 架构上实现 fp8 的 flash attention
weishengying/tiny-flash-attention
使用 cutlass 实现 flash-attention 精简版,具有教学意义
weishengying/cute_gemm
weishengying/Cute_exercise
Cute_exercise
weishengying/AutoAWQ
AutoAWQ implements the AWQ algorithm for 4-bit quantization with a 2x speedup during inference. Documentation:
weishengying/MoE
MoE layer for pytorch
weishengying/AutoFP8
weishengying/AutoGPTQ
An easy-to-use LLMs quantization package with user-friendly apis, based on GPTQ algorithm.
weishengying/cub
[ARCHIVED] Cooperative primitives for CUDA C++. See https://github.com/NVIDIA/cccl
weishengying/CUDA-Learn-Notes
🎉 Modern CUDA Learn Notes with PyTorch: fp32, fp16, bf16, fp8/int8, flash_attn, sgemm, sgemv, warp/block reduce, dot, elementwise, softmax, layernorm, rmsnorm.
weishengying/cutlass-kernels
weishengying/FasterTransformer
Transformer related optimization, including BERT, GPT
weishengying/flash-attention
Fast and memory-efficient exact attention
weishengying/How_to_optimize_in_GPU
This is a series of GPU optimization topics. Here we will introduce how to optimize the CUDA kernel in detail. I will introduce several basic kernel optimizations, including: elementwise, reduce, sgemv, sgemm, etc. The performance of these kernels is basically at or near the theoretical limit.
weishengying/lectures
Material for cuda-mode lectures
weishengying/Notes
随笔笔记
weishengying/NVIDIA_SGEMM_PRACTICE
Step-by-step optimization of CUDA SGEMM
weishengying/OmniQuant
[ICLR2024 spotlight] OmniQuant is a simple and powerful quantization technique for LLMs.
weishengying/smoothquant
[ICML 2023] SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models