cat538

cat538's Stars

LargeWorldModel/LWM
Language:Python7.1k 66 70547
rizsotto/Bear
Bear is a tool that generates a compilation database for clang tooling.
Language:C++4.8k 60 427313
InternLM/lmdeploy
LMDeploy is a toolkit for compressing, deploying, and serving LLMs.
Language:Python4.1k 33 1.3k373
htqin/awesome-model-quantization
A list of papers, docs, codes about model quantization. This repo is aimed to provide the info for model quantization research, we are continuously improving the project. Welcome to PR the works (papers, repositories) that are missed by the repo.
1.8k 61 12203
HazyResearch/ThunderKittens
Tile primitives for speedy kernels
Language:Cuda1.5k 25 2255
gkamradt/LLMTest_NeedleInAHaystack
Doing simple retrieval from LLM models at various context lengths to measure accuracy
Language:Jupyter Notebook1.4k 15 25149
NVIDIA/cccl
CUDA Core Compute Libraries
Language:C++1.1k 31 1.3k127
Vahe1994/AQLM
Official Pytorch repository for Extreme Compression of Large Language Models via Additive Quantization https://arxiv.org/pdf/2401.06118.pdf and PV-Tuning: Beyond Straight-Through Estimation for Extreme LLM Compression https://arxiv.org/abs/2405.14852
Language:Python1.1k 20 70171
horseee/Awesome-Efficient-LLM
A curated list for Efficient Large Language Models
Language:Python1.1k 42 279
HuangOwen/Awesome-LLM-Compression
Awesome LLM compression research papers and tools.
1k 40 363
Cornell-RelaxML/quip-sharp
Language:Python475 11 6042
NVIDIA/nvbench
CUDA Kernel Benchmarking Library
Language:Cuda472 17 9663
NVIDIA/cuCollections
Language:C++465 17 18083
mit-han-lab/qserve
QServe: W4A8KV4 Quantization and System Co-design for Efficient LLM Serving
Language:Python391 9 2816
microsoft/BitBLAS
BitBLAS is a library to support mixed-precision matrix multiplications, especially for quantized LLM deployment.
Language:Python322 13 5329
mirage-project/mirage
A multi-level tensor algebra superoptimizer
Language:C++310 10 2118
hahnyuan/LLM-Viewer
Analyze the inference of Large Language Models (LLMs). Analyze aspects like computation, storage, transmission, and hardware roofline model in a user-friendly interface.
Language:Python265 2 829
Bruce-Lee-LY/cuda_hgemm
Several optimization methods of half-precision general matrix multiplication (HGEMM) using tensor core with WMMA API and MMA PTX instruction.
Language:Cuda257 4 1261
efeslab/Atom
[MLSys'24] Atom: Low-bit Quantization for Efficient and Accurate LLM Serving
Language:Cuda253 11 1620
KnowingNothing/MatmulTutorial
A Easy-to-understand TensorOp Matmul Tutorial
Language:C++248 8 927
ModelTC/llmc
This is the official PyTorch implementation of "LLMC: Benchmarking Large Language Model Quantization with a Versatile Compression Toolkit".
Language:Python213 9 1324
jy-yuan/KIVI
KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache
Language:Python208 5 2418
usyd-fsalab/fp6_llm
An efficient GPU support for LLM inference with x-bit quantization (e.g. FP6,FP5).
Language:Cuda171 5 914
mit-han-lab/Quest
[ICML 2024] Quest: Query-Aware Sparsity for Efficient Long-Context LLM Inference
Language:Cuda150 3 76
opengear-project/GEAR
GEAR: An Efficient KV Cache Compression Recipefor Near-Lossless Generative Inference of LLM
Language:Python133 1 1811
SqueezeBits/QUICK
QUICK: Quantization-aware Interleaving and Conflict-free Kernel for efficient LLM inference
Language:Python106 1 66
jeffreysijuntan/lloco
The official repo for "LLoCo: Learning Long Contexts Offline"
Language:Python104 3 39
mit-han-lab/lmquant
Language:Python100 2 164
naver-aics/lut-gemm
Language:C++28 2 14
EfficientLLMSys/MuxServe
Language:Jupyter Notebook10 0 12