cat538's Stars
LargeWorldModel/LWM
rizsotto/Bear
Bear is a tool that generates a compilation database for clang tooling.
InternLM/lmdeploy
LMDeploy is a toolkit for compressing, deploying, and serving LLMs.
htqin/awesome-model-quantization
A list of papers, docs, codes about model quantization. This repo is aimed to provide the info for model quantization research, we are continuously improving the project. Welcome to PR the works (papers, repositories) that are missed by the repo.
HazyResearch/ThunderKittens
Tile primitives for speedy kernels
gkamradt/LLMTest_NeedleInAHaystack
Doing simple retrieval from LLM models at various context lengths to measure accuracy
NVIDIA/cccl
CUDA Core Compute Libraries
Vahe1994/AQLM
Official Pytorch repository for Extreme Compression of Large Language Models via Additive Quantization https://arxiv.org/pdf/2401.06118.pdf and PV-Tuning: Beyond Straight-Through Estimation for Extreme LLM Compression https://arxiv.org/abs/2405.14852
horseee/Awesome-Efficient-LLM
A curated list for Efficient Large Language Models
HuangOwen/Awesome-LLM-Compression
Awesome LLM compression research papers and tools.
Cornell-RelaxML/quip-sharp
NVIDIA/nvbench
CUDA Kernel Benchmarking Library
NVIDIA/cuCollections
mit-han-lab/qserve
QServe: W4A8KV4 Quantization and System Co-design for Efficient LLM Serving
microsoft/BitBLAS
BitBLAS is a library to support mixed-precision matrix multiplications, especially for quantized LLM deployment.
mirage-project/mirage
A multi-level tensor algebra superoptimizer
hahnyuan/LLM-Viewer
Analyze the inference of Large Language Models (LLMs). Analyze aspects like computation, storage, transmission, and hardware roofline model in a user-friendly interface.
Bruce-Lee-LY/cuda_hgemm
Several optimization methods of half-precision general matrix multiplication (HGEMM) using tensor core with WMMA API and MMA PTX instruction.
efeslab/Atom
[MLSys'24] Atom: Low-bit Quantization for Efficient and Accurate LLM Serving
KnowingNothing/MatmulTutorial
A Easy-to-understand TensorOp Matmul Tutorial
ModelTC/llmc
This is the official PyTorch implementation of "LLMC: Benchmarking Large Language Model Quantization with a Versatile Compression Toolkit".
jy-yuan/KIVI
KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache
usyd-fsalab/fp6_llm
An efficient GPU support for LLM inference with x-bit quantization (e.g. FP6,FP5).
mit-han-lab/Quest
[ICML 2024] Quest: Query-Aware Sparsity for Efficient Long-Context LLM Inference
opengear-project/GEAR
GEAR: An Efficient KV Cache Compression Recipefor Near-Lossless Generative Inference of LLM
SqueezeBits/QUICK
QUICK: Quantization-aware Interleaving and Conflict-free Kernel for efficient LLM inference
jeffreysijuntan/lloco
The official repo for "LLoCo: Learning Long Contexts Offline"
mit-han-lab/lmquant
naver-aics/lut-gemm
EfficientLLMSys/MuxServe