MicroZHY

MicroZHY's Stars

FlagOpen/FlagGems
FlagGems is an operator library for large language models implemented in Triton Language.
Language:Python29828
IST-DASLab/marlin
FP16xINT4 LLM inference kernel that can achieve near-ideal ~4x speedups up to medium batchsizes of 16-32 tokens.
Language:Python59645
wangsiping97/FastGEMV
High-speed GEMV kernels, at most 2.7x speedup compared to pytorch baseline.
Language:Cuda833
NVIDIA/AMGX
Distributed multigrid linear solver library on GPU
Language:Cuda487141
DefTruth/Awesome-LLM-Inference
📖A curated list of Awesome LLM Inference Paper with codes, TensorRT-LLM, vLLM, streaming-llm, AWQ, SmoothQuant, WINT8/4, Continuous Batching, FlashAttention, PagedAttention etc.
2.7k184
flashinfer-ai/flashinfer
FlashInfer: Kernel Library for LLM Serving
Language:Cuda1.3k119
kwea123/pytorch-cppcuda-tutorial
tutorial for writing custom pytorch cpp+cuda kernel, applied on volume rendering (NeRF)
Language:Cuda37432
vllm-project/llm-compressor
Transformers-compatible library for applying various compression algorithms to LLMs for optimized deployment with vLLM
Language:Python59849
ModelTC/lightllm
LightLLM is a Python-based LLM (Large Language Model) inference and serving framework, notable for its lightweight design, easy scalability, and high-speed performance.
Language:Python2.5k198
Qcompiler/MIXQ
MIXQ: Taming Dynamic Outliers in Mixed-Precision Quantization by Online Prediction
Language:HTML5110
HPMLL/DTC-SpMM_ASPLOS24
Language:C++192
youngyangyang04/kamacoder-solutions
卡码网题解全集
358124
xiaoyeli/superlu
Supernodal sparse direct solver. https://portal.nersc.gov/project/sparse/superlu/
Language:C27995
usyd-fsalab/fp6_llm
An efficient GPU support for LLM inference with x-bit quantization (e.g. FP6,FP5).
Language:Cuda18615
SC24RebuttalMIXQ/Rebuttal
The repository supplements MIXQ with tested on H100, a list of revised errors, additional tests of QUIK, and end-to-end text generation in TRT-LLM using QUIK and MIXQ
Language:Python1
SqueezeBits/QUICK
QUICK: Quantization-aware Interleaving and Conflict-free Kernel for efficient LLM inference
Language:Python1106
YukeWang96/TC-GNN_ATC23
Artifact for USENIX ATC'23: TC-GNN: Bridging Sparse GNN Computation and Dense Tensor Cores on GPUs.
Language:Python4413
SuperScientificSoftwareLaboratory/TileSpGEMM
Source code of the PPoPP '22 paper: "TileSpGEMM: A Tiled Algorithm for Parallel Sparse General Matrix-Matrix Multiplication on GPUs" by Yuyao Niu, Zhengyang Lu, Haonan Ji, Shuhui Song, Zhou Jin, and Weifeng Liu.
Language:C387
ardenma/implicit-gemm-tensor-core-convolution
Simple example of how to write an Implicit GEMM Convolution in CUDA using the tensor core WMMA API and bindings for PyTorch.
Language:Cuda131
youngyangyang04/leetcode-master
《代码随想录》LeetCode 刷题攻略：200道经典题目刷题顺序，共60w字的详细图解，视频难点剖析，50余张思维导图，支持C++，Java，Python，Go，JavaScript等多语言版本，从此算法学习不再迷茫！🔥🔥 来看看，你会发现相见恨晚！🚀
Language:Shell51.3k11.4k
DefTruth/CUDA-Learn-Notes
🎉 Modern CUDA Learn Notes with PyTorch: fp32/tf32, fp16/bf16, fp8/int8, flash_attn, rope, sgemm, sgemv, warp/block reduce, dot, elementwise, softmax, layernorm, rmsnorm.
Language:Cuda1.3k144
buaa-hipo/TCStencil
Language:Cuda61
microsoft/ConvStencil
Language:Cuda217
XG-zheng/Tetris-artifact-evalution
Language:C++32
gevtushenko/matrix_format_performance
Language:Cuda269
Dao-AILab/flash-attention
Fast and memory-efficient exact attention
Language:Python13.8k1.3k
hgyhungry/ShflBW_Sparse_NN
Language:C++151
RussWong/CUDATutorial
A CUDA tutorial to make people learn CUDA program from 0
Language:Cuda18946
dauzickaite/mpfgmres
split-preconditioned FGMRES in four precisions
Language:MATLAB1
Noaman67khan/SPAI-GMRES-IR
Language:MATLAB4