weishengying

Pinned Repositories

AutoAWQ
AutoAWQ implements the AWQ algorithm for 4-bit quantization with a 2x speedup during inference. Documentation:
Language:Python1 0 00
AutoFP8
Language:Python00
AutoGPTQ
An easy-to-use LLMs quantization package with user-friendly apis, based on GPTQ algorithm.
Language:Python0 0 00
cub
[ARCHIVED] Cooperative primitives for CUDA C++. See https://github.com/NVIDIA/cccl
Language:Cuda0 0 00
Cute_exercise
Cute_exercise
Language:Cuda5 1 12
cute_gemm
Language:Cuda62
cutlass_flash_atten_fp8
使用 cutlass 仓库在 ada 架构上实现 fp8 的 flash attention
Language:Cuda53 1 13
flash-attention
Fast and memory-efficient exact attention
Language:Python0 0 00
MoE
MoE layer for pytorch
Language:C++1 1 00
tiny-flash-attention
使用 cutlass 实现 flash-attention 精简版，具有教学意义
Language:Cuda32 1 04

weishengying's Repositories

weishengying/cutlass_flash_atten_fp8
使用 cutlass 仓库在 ada 架构上实现 fp8 的 flash attention
Language:Cuda53 1 13
weishengying/tiny-flash-attention
使用 cutlass 实现 flash-attention 精简版，具有教学意义
Language:Cuda32 1 04
weishengying/cute_gemm
Language:Cuda62
weishengying/Cute_exercise
Cute_exercise
Language:Cuda5 1 12
weishengying/AutoAWQ
AutoAWQ implements the AWQ algorithm for 4-bit quantization with a 2x speedup during inference. Documentation:
Language:Python1 0 00
weishengying/MoE
MoE layer for pytorch
Language:C++1 1 00
weishengying/AutoFP8
Language:Python00
weishengying/AutoGPTQ
An easy-to-use LLMs quantization package with user-friendly apis, based on GPTQ algorithm.
Language:Python0 0 00
weishengying/cub
[ARCHIVED] Cooperative primitives for CUDA C++. See https://github.com/NVIDIA/cccl
Language:Cuda0 0 00
weishengying/CUDA-Learn-Notes
🎉 Modern CUDA Learn Notes with PyTorch: fp32, fp16, bf16, fp8/int8, flash_attn, sgemm, sgemv, warp/block reduce, dot, elementwise, softmax, layernorm, rmsnorm.
Language:Cuda00
weishengying/cutlass-kernels
Language:Cuda0 0 00
weishengying/FasterTransformer
Transformer related optimization, including BERT, GPT
Language:C++00
weishengying/flash-attention
Fast and memory-efficient exact attention
Language:Python0 0 00
weishengying/How_to_optimize_in_GPU
This is a series of GPU optimization topics. Here we will introduce how to optimize the CUDA kernel in detail. I will introduce several basic kernel optimizations, including: elementwise, reduce, sgemv, sgemm, etc. The performance of these kernels is basically at or near the theoretical limit.
weishengying/lectures
Material for cuda-mode lectures
weishengying/Notes
随笔笔记
Language:Python
weishengying/NVIDIA_SGEMM_PRACTICE
Step-by-step optimization of CUDA SGEMM
Language:Cuda0 0
weishengying/OmniQuant
[ICLR2024 spotlight] OmniQuant is a simple and powerful quantization technique for LLMs.
weishengying/smoothquant
[ICML 2023] SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models
Language:Python0 0