MicroZHY's Stars
FlagOpen/FlagGems
FlagGems is an operator library for large language models implemented in Triton Language.
IST-DASLab/marlin
FP16xINT4 LLM inference kernel that can achieve near-ideal ~4x speedups up to medium batchsizes of 16-32 tokens.
wangsiping97/FastGEMV
High-speed GEMV kernels, at most 2.7x speedup compared to pytorch baseline.
NVIDIA/AMGX
Distributed multigrid linear solver library on GPU
DefTruth/Awesome-LLM-Inference
📖A curated list of Awesome LLM Inference Paper with codes, TensorRT-LLM, vLLM, streaming-llm, AWQ, SmoothQuant, WINT8/4, Continuous Batching, FlashAttention, PagedAttention etc.
flashinfer-ai/flashinfer
FlashInfer: Kernel Library for LLM Serving
kwea123/pytorch-cppcuda-tutorial
tutorial for writing custom pytorch cpp+cuda kernel, applied on volume rendering (NeRF)
vllm-project/llm-compressor
Transformers-compatible library for applying various compression algorithms to LLMs for optimized deployment with vLLM
ModelTC/lightllm
LightLLM is a Python-based LLM (Large Language Model) inference and serving framework, notable for its lightweight design, easy scalability, and high-speed performance.
Qcompiler/MIXQ
MIXQ: Taming Dynamic Outliers in Mixed-Precision Quantization by Online Prediction
HPMLL/DTC-SpMM_ASPLOS24
youngyangyang04/kamacoder-solutions
卡码网题解全集
xiaoyeli/superlu
Supernodal sparse direct solver. https://portal.nersc.gov/project/sparse/superlu/
usyd-fsalab/fp6_llm
An efficient GPU support for LLM inference with x-bit quantization (e.g. FP6,FP5).
SC24RebuttalMIXQ/Rebuttal
The repository supplements MIXQ with tested on H100, a list of revised errors, additional tests of QUIK, and end-to-end text generation in TRT-LLM using QUIK and MIXQ
SqueezeBits/QUICK
QUICK: Quantization-aware Interleaving and Conflict-free Kernel for efficient LLM inference
YukeWang96/TC-GNN_ATC23
Artifact for USENIX ATC'23: TC-GNN: Bridging Sparse GNN Computation and Dense Tensor Cores on GPUs.
SuperScientificSoftwareLaboratory/TileSpGEMM
Source code of the PPoPP '22 paper: "TileSpGEMM: A Tiled Algorithm for Parallel Sparse General Matrix-Matrix Multiplication on GPUs" by Yuyao Niu, Zhengyang Lu, Haonan Ji, Shuhui Song, Zhou Jin, and Weifeng Liu.
ardenma/implicit-gemm-tensor-core-convolution
Simple example of how to write an Implicit GEMM Convolution in CUDA using the tensor core WMMA API and bindings for PyTorch.
youngyangyang04/leetcode-master
《代码随想录》LeetCode 刷题攻略:200道经典题目刷题顺序,共60w字的详细图解,视频难点剖析,50余张思维导图,支持C++,Java,Python,Go,JavaScript等多语言版本,从此算法学习不再迷茫!🔥🔥 来看看,你会发现相见恨晚!🚀
DefTruth/CUDA-Learn-Notes
🎉 Modern CUDA Learn Notes with PyTorch: fp32/tf32, fp16/bf16, fp8/int8, flash_attn, rope, sgemm, sgemv, warp/block reduce, dot, elementwise, softmax, layernorm, rmsnorm.
buaa-hipo/TCStencil
microsoft/ConvStencil
XG-zheng/Tetris-artifact-evalution
gevtushenko/matrix_format_performance
Dao-AILab/flash-attention
Fast and memory-efficient exact attention
hgyhungry/ShflBW_Sparse_NN
RussWong/CUDATutorial
A CUDA tutorial to make people learn CUDA program from 0
dauzickaite/mpfgmres
split-preconditioned FGMRES in four precisions
Noaman67khan/SPAI-GMRES-IR