Pinned Repositories
cute-gemm
cutlass
CUDA Templates for Linear Algebra Subroutines
flash-attention
Fast and memory-efficient exact attention
lmquant
marlin
FP16xINT4 LLM inference kernel that can achieve near-ideal ~4x speedups up to medium batchsizes of 16-32 tokens.
qserve
QServe: W4A8KV4 Quantization and System Co-design for Efficient LLM Serving
vllm
A high-throughput and memory-efficient inference and serving engine for LLMs
imisszxq's Repositories
imisszxq/cute-gemm
imisszxq/cutlass
CUDA Templates for Linear Algebra Subroutines
imisszxq/flash-attention
Fast and memory-efficient exact attention
imisszxq/lmquant
imisszxq/marlin
FP16xINT4 LLM inference kernel that can achieve near-ideal ~4x speedups up to medium batchsizes of 16-32 tokens.
imisszxq/qserve
QServe: W4A8KV4 Quantization and System Co-design for Efficient LLM Serving
imisszxq/vllm
A high-throughput and memory-efficient inference and serving engine for LLMs