jianyuheng's Stars
mlc-ai/mlc-llm
Universal LLM Deployment Engine with ML Compilation
Lightning-AI/litgpt
20+ high-performance LLMs with recipes to pretrain, finetune and deploy at scale.
NVIDIA/TensorRT-LLM
TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.
axolotl-ai-cloud/axolotl
Go ahead and axolotl questions
microsoft/LMOps
General technology for enabling AI capabilities w/ LLMs and MLLMs
turboderp/exllamav2
A fast inference library for running LLMs locally on modern consumer-class GPUs
FasterDecoding/Medusa
Medusa: Simple Framework for Accelerating LLM Generation with Multiple Decoding Heads
horseee/Awesome-Efficient-LLM
A curated list for Efficient Large Language Models
Vahe1994/AQLM
Official Pytorch repository for Extreme Compression of Large Language Models via Additive Quantization https://arxiv.org/pdf/2401.06118.pdf and PV-Tuning: Beyond Straight-Through Estimation for Extreme LLM Compression https://arxiv.org/abs/2405.14852
OpenGVLab/OmniQuant
[ICLR2024 spotlight] OmniQuant is a simple and powerful quantization technique for LLMs.
mobiusml/hqq
Official implementation of Half-Quadratic Quantization (HQQ)
SqueezeAILab/SqueezeLLM
[ICML 2024] SqueezeLLM: Dense-and-Sparse Quantization
princeton-nlp/LLM-Shearing
[ICLR 2024] Sheared LLaMA: Accelerating Language Model Pre-training via Structured Pruning
segmind/distill-sd
Segmind Distilled diffusion
mit-han-lab/qserve
QServe: W4A8KV4 Quantization and System Co-design for Efficient LLM Serving
hao-ai-lab/Consistency_LLM
[ICML 2024] CLLMs: Consistency Large Language Models
spcl/QuaRot
Code for Neurips24 paper: QuaRot, an end-to-end 4-bit inference of large language models.
Cornell-RelaxML/QuIP
Code for paper: "QuIP: 2-Bit Quantization of Large Language Models With Guarantees"
Nota-NetsPresso/BK-SDM
A Compressed Stable Diffusion for Efficient Text-to-Image Generation [ECCV'24]
OpenGVLab/EfficientQAT
EfficientQAT: Efficient Quantization-Aware Training for Large Language Models
hemingkx/Spec-Bench
Spec-Bench: A Comprehensive Benchmark and Unified Evaluation Platform for Speculative Decoding (ACL 2024 Findings)
MFaceTech/InstantID
taprosoft/llm_finetuning
Convenient wrapper for fine-tuning and inference of Large Language Models (LLMs) with several quantization techniques (GTPQ, bitsandbytes)
jaymody/speculative-sampling
Simple implementation of Speculative Sampling in NumPy for GPT-2.
ruiqixu37/distill_diffusion
Implementation of the 2023 CVPR Award Candidate: On Distillation of Guided Diffusion Models
LiqunMa/FBI-LLM
FBI-LLM: Scaling Up Fully Binarized LLMs from Scratch via Autoregressive Distillation
Qualcomm-AI-research/lr-qat
Skhaki18/optin-transformer-pruning
[ICLR 2024] The Need for Speed: Pruning Transformers with One Recipe
MFaceTech/AIGC-SD-Acceleration
onliwad101/FlexRound_LRQ
FlexRound (ICML 2023) & LRQ