ItsAbdula's Stars
IST-DASLab/marlin
FP16xINT4 LLM inference kernel that can achieve near-ideal ~4x speedups up to medium batchsizes of 16-32 tokens.
xvyaward/owq
Code for the AAAI 2024 Oral paper "OWQ: Outlier-Aware Weight Quantization for Efficient Fine-Tuning and Inference of Large Language Models".
srush/GPU-Puzzles
Solve puzzles. Learn CUDA.
qwopqwop200/GPTQ-for-LLaMa
4 bits quantization of LLaMA using GPTQ
AutoGPTQ/AutoGPTQ
An easy-to-use LLMs quantization package with user-friendly apis, based on GPTQ algorithm.
IST-DASLab/gptq
Code for the ICLR 2023 paper "GPTQ: Accurate Post-training Quantization of Generative Pretrained Transformers".
mit-han-lab/llm-awq
[MLSys 2024 Best Paper Award] AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration
efeslab/Nanoflow
A throughput-oriented high-performance serving framework for LLMs
pytorch/ao
PyTorch native quantization and sparsity for training and inference
huggingface/optimum-quanto
A pytorch quantization backend for optimum
mit-han-lab/smoothquant
[ICML 2023] SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models
LLMServe/DistServe
Disaggregated serving system for Large Language Models (LLMs).
facebookresearch/sapiens
High-resolution models for human tasks.
casper-hansen/AutoAWQ
AutoAWQ implements the AWQ algorithm for 4-bit quantization with a 2x speedup during inference. Documentation:
htqin/awesome-model-quantization
A list of papers, docs, codes about model quantization. This repo is aimed to provide the info for model quantization research, we are continuously improving the project. Welcome to PR the works (papers, repositories) that are missed by the repo.
casys-kaist/NeuPIMs
NeuPIMs Simulator
microsoft/ParrotServe
[OSDI'24] Serving LLM-based Applications Efficiently with Semantic Variable
AlibabaPAI/FLASHNN
triton-lang/triton
Development repository for the Triton language and compiler
cuda-mode/lectures
Material for cuda-mode lectures
Kobzol/hardware-effects-gpu
Demonstration of various hardware effects on CUDA GPUs.
srush/Triton-Puzzles
Puzzles for learning Triton
microsoft/vidur
A large-scale simulation framework for LLM inference
Mozilla-Ocho/llamafile
Distribute and run LLMs with a single file.
cuda-mode/resource-stream
CUDA related news and material links
microsoft/vattention
Dynamic Memory Management for Serving LLMs without PagedAttention
AlibabaPAI/llumnix
Efficient and easy multi-instance LLM serving
HanGuo97/flute
Fast Matrix Multiplications for Lookup Table-Quantized LLMs
CisMine/Guide-NVIDIA-Tools
NVIDIA tools guide
AnswerDotAI/gpu.cpp
A lightweight library for portable low-level GPU computation using WebGPU.