/Awesome-LLM-Compression

Awesome LLM compression research papers and tools.

MIT LicenseMIT

Awesome LLM Compression

Awesome LLM compression research papers and tools to accelerate LLM training and inference.

Contents

Papers

Survey

  • A Survey on Model Compression for Large Language Models
    Arxiv 2023 [Paper]

  • The Efficiency Spectrum of Large Language Models: An Algorithmic Survey
    Arxiv 2023 [Paper]

  • Efficient Large Language Models: A Survey
    Arxiv 2023 [Paper] [GitHub Page]

  • Towards Efficient Generative Large Language Model Serving: A Survey from Algorithms to Systems
    Arxiv 2023 [Paper]

  • Understanding LLMs: A Comprehensive Overview from Training to Inference
    Arxiv 2024 [Paper]

  • A Survey of Resource-efficient LLM and Multimodal Foundation Models
    Arxiv 2024 [Paper]

  • A Survey on Hardware Accelerators for Large Language Models
    Arxiv 2024 [Paper]

  • A Comprehensive Survey of Compression Algorithms for Language Models
    Arxiv 2024 [Paper]

Quantization

  • ZeroQuant: Efficient and Affordable Post-Training Quantization for Large-Scale Transformers
    NeurIPS 2022 [Paper] [Code (DeepSpeed)]

  • LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale
    NeurIPS 2022 [Paper] [Code]

  • Outlier Suppression: Pushing the Limit of Low-bit Transformer Language Models
    NeurIPS 2022 [Paper] [Code]

  • LUT-GEMM: Quantized Matrix Multiplication based on LUTs for Efficient Inference in Large-Scale Generative Language Models
    Arxiv 2022 [Paper]

  • SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models
    ICML 2023 [Paper] [Code]

  • FlexRound: Learnable Rounding based on Element-wise Division for Post-Training Quantization
    ICML 2023 [Paper] [Code (DeepSpeed)]

  • Understanding INT4 Quantization for Transformer Models: Latency Speedup, Composability, and Failure Cases
    ICML 2023 [Paper] [Code]

  • The case for 4-bit precision: k-bit Inference Scaling Laws
    ICML 2023 [Paper]

  • GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers
    ICLR 2023 [Paper] [Code]

  • PreQuant: A Task-agnostic Quantization Approach for Pre-trained Language Models
    ACL 2023 [Paper]

  • Boost Transformer-based Language Models with GPU-Friendly Sparsity and Quantization
    ACL 2023 [Paper]

  • QLoRA: Efficient Finetuning of Quantized LLMs
    NeurIPS 2023 [Paper] [Code]

  • The Quantization Model of Neural Scaling
    NeurIPS 2023 [Paper]

  • Quantized Distributed Training of Large Models with Convergence Guarantees
    Arxiv 2023 [Paper]

  • RPTQ: Reorder-based Post-training Quantization for Large Language Models
    Arxiv 2023 [Paper] [Code]

  • ZeroQuant-V2: Exploring Post-training Quantization in LLMs from Comprehensive Study to Low Rank Compensation
    Arxiv 2023 [Paper] [Code]

  • Integer or Floating Point? New Outlooks for Low-Bit Quantization on Large Language Models
    Arxiv 2023 [Paper]

  • Memory-Efficient Fine-Tuning of Compressed Large Language Models via sub-4-bit Integer Quantization
    NeurIPS 2023 [Paper]

  • Compress, Then Prompt: Improving Accuracy-Efficiency Trade-off of LLM Inference with Transferable Prompt
    Arxiv 2023 [Paper]

  • AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration
    Arxiv 2023 [Paper] [Code]

  • LLM-QAT: Data-Free Quantization Aware Training for Large Language Models
    Arxiv 2023 [Paper] [Code]

  • SpQR: A Sparse-Quantized Representation for Near-Lossless LLM Weight Compression
    Arxiv 2023 [Paper] [Code]

  • OWQ: Lessons learned from activation outliers for weight quantization in large language models
    Arxiv 2023 [Paper]

  • SqueezeLLM: Dense-and-Sparse Quantization
    Arxiv 2023 [Paper] [Code]

  • INT2.1: Towards Fine-Tunable Quantized Large Language Models with Error Correction through Low-Rank Adaptation
    Arxiv 2023 [Paper]

  • INT-FP-QSim: Mixed Precision and Formats For Large Language Models and Vision Transformers
    Arxiv 2023 [Paper] [Code]

  • QIGen: Generating Efficient Kernels for Quantized Inference on Large Language Models
    Arxiv 2023 [Paper] [Code]

  • Do Emergent Abilities Exist in Quantized Large Language Models: An Empirical Study
    Arxiv 2023 [Paper]

  • ZeroQuant-FP: A Leap Forward in LLMs Post-Training W4A8 Quantization Using Floating-Point Formats
    Arxiv 2023 [Paper] [Code (DeepSpeed)]

  • OliVe: Accelerating Large Language Models via Hardware-friendly Outlier-Victim Pair Quantization
    ISCA 2023 [Paper]

  • NUPES : Non-Uniform Post-Training Quantization via Power Exponent Search
    Arxiv 2023 [Paper]

  • GPT-Zip: Deep Compression of Finetuned Large Language Models
    ICML 2023 Workshop ES-FoMO [Paper]

  • Generating Efficient Kernels for Quantized Inference on Large Language Models
    ICML 2023 Workshop ES-FoMO [Paper]

  • Gradient-Based Post-Training Quantization: Challenging the Status Quo
    Arxiv 2023 [Paper]

  • FineQuant: Unlocking Efficiency with Fine-Grained Weight-Only Quantization for LLMs
    Arxiv 2023 [Paper]

  • OmniQuant: Omnidirectionally Calibrated Quantization for Large Language Models
    ICLR 2024 [Paper] [Code]

  • FPTQ: Fine-grained Post-Training Quantization for Large Language Models
    Arxiv 2023 [Paper]

  • eDKM: An Efficient and Accurate Train-time Weight Clustering for Large Language Models
    Arxiv 2023 [Paper]

  • QuantEase: Optimization-based Quantization for Language Models -- An Efficient and Intuitive Algorithm
    Arxiv 2023 [Paper]

  • Norm Tweaking: High-performance Low-bit Quantization of Large Language Models
    Arxiv 2023 [Paper]

  • Understanding the Impact of Post-Training Quantization on Large-scale Language Models
    Arxiv 2023 [Paper]

  • MEMORY-VQ: Compression for Tractable Internet-Scale Memory
    Arxiv 2023 [Paper]

  • Optimize Weight Rounding via Signed Gradient Descent for the Quantization of LLMs
    Arxiv 2023 [Paper] [Code]

  • Efficient Post-training Quantization with FP8 Formats
    Arxiv 2023 [Paper] [Code (Intel® Neural Compressor)]

  • QA-LoRA: Quantization-Aware Low-Rank Adaptation of Large Language Models
    Arxiv 2023 [Paper] [Code]

  • Rethinking Channel Dimensions to Isolate Outliers for Low-bit Weight Quantization of Large Language Models
    Arxiv 2023 [Paper]

  • ModuLoRA: Finetuning 3-Bit LLMs on Consumer GPUs by Integrating with Modular Quantizers
    Arxiv 2023 [Paper]

  • PB-LLM: Partially Binarized Large Language Models
    Arxiv 2023 [Paper] [Code]

  • Dual Grained Quantization: Efficient Fine-Grained Quantization for LLM
    Arxiv 2023 [Paper]

  • Rethinking Channel Dimensions to Isolate Outliers for Low-bit Weight Quantization of Large Language Models
    Arxiv 2023 [Paper]

  • QLLM: Accurate and Efficient Low-Bitwidth Quantization for Large Language Models
    Arxiv 2023 [Paper]

  • LoftQ: LoRA-Fine-Tuning-Aware Quantization for Large Language Models
    Arxiv 2023 [Paper]

  • QFT: Quantized Full-parameter Tuning of LLMs with Affordable Resources
    Arxiv 2023 [Paper]

  • TEQ: Trainable Equivalent Transformation for Quantization of LLMs
    Arxiv 2023 [Paper] [Code (Intel® Neural Compressor)]

  • BitNet: Scaling 1-bit Transformers for Large Language Models
    Arxiv 2023 [Paper]

  • FP8-LM: Training FP8 Large Language Models
    Arxiv 2023 [Paper] [Code]

  • QUIK: Towards End-to-End 4-Bit Inference on Generative Large Language Models
    Arxiv 2023 [Paper] [Code]

  • AFPQ: Asymmetric Floating Point Quantization for LLMs
    Arxiv 2023 [Paper] [Code]

  • AWEQ: Post-Training Quantization with Activation-Weight Equalization for Large Language Models
    Arxiv 2023 [Paper]

  • Atom: Low-bit Quantization for Efficient and Accurate LLM Serving
    Arxiv 2023 [Paper]

  • QMoE: Practical Sub-1-Bit Compression of Trillion-Parameter Models
    Arxiv 2023 [Paper]

  • Dissecting the Runtime Performance of the Training, Fine-tuning, and Inference of Large Language Models
    Arxiv 2023 [Paper]

  • How Does Calibration Data Affect the Post-training Pruning and Quantization of Large Language Models?
    Arxiv 2023 [Paper]

  • A Speed Odyssey for Deployable Quantization of LLMs
    Arxiv 2023 [Paper]

  • Enabling Fast 2-bit LLM on GPUs: Memory Alignment, Sparse Outlier, and Asynchronous Dequantization
    Arxiv 2023 [Paper]

  • Quantizable Transformers: Removing Outliers by Helping Attention Heads Do Nothing
    NeurIPS 2023 [Paper] [Code]

  • Efficient LLM Inference on CPUs
    NeurIPS 2023 on Efficient Natural Language and Speech Processing [Paper] [Code]

  • The Cost of Compression: Investigating the Impact of Compression on Parametric Knowledge in Language Models
    EMNLP Findings 2023 [Paper]

  • Zero-Shot Sharpness-Aware Quantization for Pre-trained Language Models
    EMNLP 2023 [Paper]

  • Revisiting Block-based Quantisation: What is Important for Sub-8-bit LLM Inference?
    EMNLP 2023 [Paper] [Code]

  • Outlier Suppression+: Accurate quantization of large language models by equivalent and optimal shifting and scaling
    EMNLP 2023 [Paper]

  • Watermarking LLMs with Weight Quantization
    EMNLP 2023 [Paper] [Code]

  • Enhancing Computation Efficiency in Large Language Models through Weight and Activation Quantization
    EMNLP 2023 [Paper]

  • LLM-FP4: 4-Bit Floating-Point Quantized Transformers
    EMNLP 2023 [Paper] [Code]

  • Agile-Quant: Activation-Guided Quantization for Faster Inference of LLMs on the Edge
    AAAI 2024 [Paper]

  • SmoothQuant+: Accurate and Efficient 4-bit Post-Training WeightQuantization for LLM
    Arxiv 2023 [Paper]

  • CBQ: Cross-Block Quantization for Large Language Models
    Arxiv 2023 [Paper]

  • ZeroQuant(4+2): Redefining LLMs Quantization with a New FP6-Centric Strategy for Diverse Generative Tasks
    Arxiv 2023 [Paper]

  • QuIP: 2-Bit Quantization of Large Language Models With Guarantees
    NeurIPS 2023 [Paper] [Code]

  • A Performance Evaluation of a Quantized Large Language Model on Various Smartphones
    Arxiv 2023 [Paper]

  • FlightLLM: Efficient Large Language Model Inference with a Complete Mapping Flow on FPGA
    Arxiv 2024 [Paper]

  • Extreme Compression of Large Language Models via Additive Quantization
    Arxiv 2024 [Paper]

  • Quantized Side Tuning: Fast and Memory-Efficient Tuning of Quantized Large Language Models
    Arxiv 2024 [Paper]

  • Inferflow: an Efficient and Highly Configurable Inference Engine for Large Language Models
    Arxiv 2024 [Paper]

  • FP6-LLM: Efficiently Serving Large Language Models Through FP6-Centric Algorithm-System Co-Design
    Arxiv 2024 [Paper]

  • KVQuant: Towards 10 Million Context Length LLM Inference with KV Cache Quantization
    Arxiv 2024 [Paper]

  • Can Large Language Models Understand Context?
    Arxiv 2024 [Paper]

  • AffineQuant: Affine Transformation Quantization for Large Language Models
    EACL 2024 [Paper]

Pruning and Sparsity

  • The Lazy Neuron Phenomenon: On Emergence of Activation Sparsity in Transformers
    ICLR 2023 [Paper]

  • Deja Vu: Contextual Sparsity for Efficient LLMs at Inference Time
    ICML 2023 [Paper] [Code]

  • LoSparse: Structured Compression of Large Language Models based on Low-Rank and Sparse Approximation
    ICML 2023 [Paper] [Code]

  • LLM-Pruner: On the Structural Pruning of Large Language Models
    NeurIPS 2023 [Paper] [Code]

  • ZipLM: Inference-Aware Structured Pruning of Language Models
    NeurIPS 2023 [Paper] [Code]

  • H2O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models
    NeurIPS 2023 [Paper] [Code]

  • Scissorhands: Exploiting the Persistence of Importance Hypothesis for LLM KV Cache Compression at Test Time
    NeurIPS 2023 [Paper]

  • The Emergence of Essential Sparsity in Large Pre-trained Models: The Weights that Matter
    NeurIPS 2023 [Paper] [Code]

  • Learning to Compress Prompts with Gist Tokens
    NeurIPS 2023 [Paper]

  • Dynamic Context Pruning for Efficient and Interpretable Autoregressive Transformers
    NeurIPS 2023 [Paper]

  • Prune and Tune: Improving Efficient Pruning Techniques for Massive Language Models
    ICLR 2023 TinyPapers [Paper]

  • SparseGPT: Massive Language Models Can Be Accurately Pruned in One-Shot
    Arxiv 2023 [Paper] [Code]

  • Unlocking Context Constraints of LLMs: Enhancing Context Efficiency of LLMs with Self-Information-Based Content Filtering
    Arxiv 2023 [Paper] [Code]

  • Rethinking the Role of Scale for In-Context Learning: An Interpretability-based Case Study at 66 Billion Scale
    ACL 2023 [Paper] [Code]

  • Structured Pruning for Efficient Generative Pre-trained Language Models
    ACL 2023 [Paper]

  • A Simple and Effective Pruning Approach for Large Language Models
    Arxiv 2023 [Paper] [Code]

  • Pruning Meets Low-Rank Parameter-Efficient Fine-Tuning
    Arxiv 2023 [Paper]

  • Structural pruning of large language models via neural architecture search
    AutoML 2023 [Paper]

  • Pruning Large Language Models via Accuracy Predictor
    ICASSP 2024 [Paper]

  • Flash-LLM: Enabling Cost-Effective and Highly-Efficient Large Generative Model Inference with Unstructured Sparsity
    VLDB 2024 [Paper] [Cde]

  • Compressing LLMs: The Truth is Rarely Pure and Never Simple
    Arxiv 2023 [Paper]

  • Junk DNA Hypothesis: A Task-Centric Angle of LLM Pre-trained Weights through Sparsity
    Arxiv 2023 [Paper] [Code]

  • Model Tells You What to Discard: Adaptive KV Cache Compression for LLMs
    Arxiv 2023 [Paper]

  • Compresso: Structured Pruning with Collaborative Prompting Learns Compact Large Language Models
    Arxiv 2023 [Paper] [Code]

  • Outlier Weighed Layerwise Sparsity (OWL): A Missing Secret Sauce for Pruning LLMs to High Sparsity
    Arxiv 2023 [Paper] [Code]

  • Sheared LLaMA: Accelerating Language Model Pre-training via Structured Pruning
    Arxiv 2023 [Paper] [Code]

  • Dynamic Sparse No Training: Training-Free Fine-tuning for Sparse LLMs
    Arxiv 2023 [Paper] [Code]

  • One-Shot Sensitivity-Aware Mixed Sparsity Pruning for Large Language Models
    ICASSP 2024 [Paper]

  • Survival of the Most Influential Prompts: Efficient Black-Box Prompt Search via Clustering and Pruning
    EMNLP 2023 Findings [Paper]

  • The Cost of Compression: Investigating the Impact of Compression on Parametric Knowledge in Language Models
    EMNLP Findings 2023 [Paper]

  • Divergent Token Metrics: Measuring degradation to prune away LLM components -- and optimize quantization
    Arxiv 2023 [Paper]

  • LoRAShear: Efficient Large Language Model Structured Pruning and Knowledge Recovery
    Arxiv 2023 [Paper]

  • ReLU Strikes Back: Exploiting Activation Sparsity in Large Language Models
    Arxiv 2023 [Paper]

  • E-Sparse: Boosting the Large Language Model Inference through Entropy-based N:M Sparsity
    Arxiv 2023 [Paper]

  • Beyond Size: How Gradients Shape Pruning Decisions in Large Language Models
    Arxiv 2023 [Paper] [Code]

  • How Does Calibration Data Affect the Post-training Pruning and Quantization of Large Language Models?
    Arxiv 2023 [Paper]

  • BESA: Pruning Large Language Models with Blockwise Parameter-Efficient Sparsity Allocation
    OpenReview [Paper] [Code]

  • PUSHING GRADIENT TOWARDS ZERO: A NOVEL PRUNING METHOD FOR LARGE LANGUAGE MODELS
    OpenReview 2023 [Paper]

  • An Efficient Plug-and-Play Post-Training Pruning Strategy in Large Language Models
    Preprints 2023 [Paper]

  • Lighter, yet More Faithful: Investigating Hallucinations in Pruned Large Language Models for Abstractive Summarization
    Arxiv 2023 [Paper] [Code]

  • LORAPRUNE: PRUNING MEETS LOW-RANK PARAMETER-EFFICIENT FINE-TUNING
    Arxiv 2023 [Paper]

  • Mini-GPTs: Efficient Large Language Models through Contextual Pruning
    Arxiv 2023 [Paper] [Code]

  • The LLM Surgeon
    Arxiv 2023 [Paper]

  • Fluctuation-based Adaptive Structured Pruning for Large Language Models
    AAAI 2024 [Paper]

  • How to Prune Your Language Model: Recovering Accuracy on the "Sparsity May Cry'' Benchmark
    CPAL 2024 [Paper]

  • PERP: Rethinking the Prune-Retrain Paradigm in the Era of LLMs
    Arxiv 2023 [Paper]

  • Fast and Optimal Weight Update for Pruned Large Language Models
    Arxiv 2024 [Paper]

  • APT: Adaptive Pruning and Tuning Pretrained Language Models for Efficient Training and Inference
    Arxiv 2024 [Paper]

  • Scaling Sparse Fine-Tuning to Large Language Models
    Arxiv 2024 [Paper]

  • SliceGPT: Compress Large Language Models by Deleting Rows and Columns
    ICLR 2024 [Paper] [Code]

Distillation

  • Lifting the Curse of Capacity Gap in Distilling Language Models
    ACL 2023 [Paper] [Code]

  • Symbolic Chain-of-Thought Distillation: Small Models Can Also "Think" Step-by-Step
    ACL 2023 [Ppaer]

  • Distilling Step-by-Step! Outperforming Larger Language Models with Less Training Data and Smaller Model Sizes
    ACL 2023 [Paper]

  • SCOTT: Self-Consistent Chain-of-Thought Distillation
    ACL 2023 [Paper]

  • DISCO: Distilling Counterfactuals with Large Language Models
    ACL 2023 [Paper] [Code]

  • LaMini-LM: A Diverse Herd of Distilled Models from Large-Scale Instructions
    Arxiv 2023 [Paper] [Code]

  • How To Train Your (Compressed) Large Language Model
    Arxiv 2023 [Paper]

  • The False Promise of Imitating Proprietary LLMs
    Arxiv 2023 [Paper]

  • GPT4All: Training an Assistant-style Chatbot with Large Scale Data Distillation from GPT-3.5-Turbo
    Arxiv 2023 [Paper] [Code]

  • PaD: Program-aided Distillation Specializes Large Models in Reasoning
    Arxiv 2023 [Paper]

  • Knowledge Distillation of Large Language Models
    Arxiv 2023 [Paper] [Code]

  • GKD: Generalized Knowledge Distillation for Auto-regressive Sequence Models
    Arxiv 2023 [Paper]

  • Chain-of-Thought Prompt Distillation for Multimodal Named Entity and Multimodal Relation Extraction
    Arxiv 2023 [Paper]

  • Task-agnostic Distillation of Encoder-Decoder Language Models
    Arxiv 2023 [Paper]

  • Sci-CoT: Leveraging Large Language Models for Enhanced Knowledge Distillation in Small Models for Scientific QA
    Arxiv 2023 [Paper]

  • Can a student Large Language Model perform as well as it's teacher?
    Arxiv 2023 [Paper]

  • Multistage Collaborative Knowledge Distillation from Large Language Models
    Arxiv 2023 [Paper]

  • Lion: Adversarial Distillation of Closed-Source Large Language Model
    EMNLP 2023 [Paper] [Code]

  • MCC-KD: Multi-CoT Consistent Knowledge Distillation
    EMNLP 2023 [Paper]

  • PromptMix: A Class Boundary Augmentation Method for Large Language Model Distillation
    EMNLP 2023 [Paper]

  • YODA: Teacher-Student Progressive Learning for Language Models
    Arxiv 2023 [Paper]

Efficient Prompting

  • Did You Read the Instructions? Rethinking the Effectiveness of Task Definitions in Instruction Learning
    ACL 2023 [Paper] [Code]

  • Batch Prompting: Efficient Inference with Large Language Model APIs
    EMNLP 2023 [Paper] [Code]

  • Adapting Language Models to Compress Contexts
    EMNLP 2023 [Paper] [Code]

  • Compressing Context to Enhance Inference Efficiency of Large Language Models
    EMNLP 2023 [Paper] [Code]

  • LLMLingua: Compressing Prompts for Accelerated Inference of Large Language Models
    EMNLP 2023 [Paper] [Code]

  • Vector-Quantized Prompt Learning for Paraphrase Generation
    EMNLP 2023 Findings [Paper]

  • Efficient Prompting via Dynamic In-Context Learning
    Arxiv 2023 [Paper]

  • Learning to Compress Prompts with Gist Tokens
    Arxiv 2023 [Paper] [Code]

  • In-context Autoencoder for Context Compression in a Large Language Model
    Arxiv 2023 [Paper]

  • Discrete Prompt Compression with Reinforcement Learning
    Arxiv 2023 [Paper]

  • BatchPrompt: Accomplish more with less
    Arxiv 2023 [Paper]

  • (Dynamic) Prompting might be all you need to repair Compressed LLMs
    Arxiv 2023 [Paper]

  • RECOMP: Improving Retrieval-Augmented LMs with Compression and Selective Augmentation
    Arxiv 2023 [Paper] [Code]

  • LongLLMLingua: Accelerating and Enhancing LLMs in Long Context Scenarios via Prompt Compression
    Arxiv 2023 [Paper] [Code]

  • Extending Context Window of Large Language Models via Semantic Compression
    Arxiv 2023 [Paper]

  • Boosting LLM Reasoning: Push the Limits of Few-shot Learning with Reinforced In-Context Pruning
    Arxiv 2023 [Paper]

  • The Impact of Reasoning Step Length on Large Language Models
    Arxiv 2024 [Paper]

Other

  • TensorGPT: Efficient Compression of the Embedding Layer in LLMs based on the Tensor-Train Decomposition
    Arxiv 2023 [Paper]

  • Dynamic Context Pruning for Efficient and Interpretable Autoregressive Transformers
    Arxiv 2023 [Paper]

  • SkipDecode: Autoregressive Skip Decoding with Batching and Caching for Efficient LLM Inference
    Arxiv 2023 [Paper]

  • Scaling In-Context Demonstrations with Structured Attention
    Arxiv 2023 [Paper]

  • Response Length Perception and Sequence Scheduling: An LLM-Empowered LLM Inference Pipeline
    Arxiv 2023 [Paper] [Code]

  • CPET: Effective Parameter-Efficient Tuning for Compressed Large Language Models
    Arxiv 2023 [Paper]

  • Ternary Singular Value Decomposition as a Better Parameterized Form in Linear Mapping
    Arxiv 2023 [Paper]

  • LLMCad: Fast and Scalable On-device Large Language Model Inference
    Arxiv 2023 [Paper]

  • LongLoRA: Efficient Fine-tuning of Long-Context Large Language Models
    Arxiv 2023 [Paper] [Code]

  • LORD: Low Rank Decomposition Of Monolingual Code LLMs For One-Shot Compression
    Arxiv 2023 [Paper] [Code]

  • Mixture of Tokens: Efficient LLMs through Cross-Example Aggregation
    Arxiv 2023 [Paper]

  • Efficient Streaming Language Models with Attention Sinks
    Arxiv 2023 [Paper] [Code]

  • Efficient Large Language Models Fine-Tuning On Graphs
    Arxiv 2023 [Paper]

  • SparQ Attention: Bandwidth-Efficient LLM Inference
    Arxiv 2023 [Paper]

  • Rethinking Compression: Reduced Order Modelling of Latent Features in Large Language Models
    Arxiv 2023 [Paper]

  • PowerInfer: Fast Large Language Model Serving with a Consumer-grade GPU
    Arxiv 2023 [Paper] [Code]

  • Dataset Quantization
    ICCV 2023 [Paper] [Code]

  • Text Alignment Is An Efficient Unified Model for Massive NLP Tasks
    NeurIPS 2023 [Paper] [Code]

  • Context Compression for Auto-regressive Transformers with Sentinel Tokens
    EMNLP 2023 [Paper] [Code]

  • TCRA-LLM: Token Compression Retrieval Augmented Large Language Model for Inference Cost Reduction
    EMNLP 2023 Findings [Paper]

  • Retrieval-based Knowledge Transfer: An Effective Approach for Extreme Large Language Model Compression
    EMNLP 2023 Findings [Paper]

  • FFSplit: Split Feed-Forward Network For Optimizing Accuracy-Efficiency Trade-off in Language Model Inference
    Arxiv 2024 [Paper]

  • LoMA: Lossless Compressed Memory Attention
    Arxiv 2024 [Paper]

  • Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads
    Arxiv 2024 [Paper] [Code]

  • BiTA: Bi-Directional Tuning for Lossless Acceleration in Large Language Models
    Arxiv 2024 [Paper] [Code]

  • CompactifAI: Extreme Compression of Large Language Models using Quantum-Inspired Tensor Networks
    Arxiv 2024 [Paper]

Tools

  • BMCook: Model Compression for Big Models [Code]

  • llama.cpp: Inference of LLaMA model in pure C/C++ [Code]

  • LangChain: Building applications with LLMs through composability [Code]

  • GPTQ-for-LLaMA: 4 bits quantization of LLaMA using GPTQ [Code]

  • Alpaca-CoT: An Instruction Fine-Tuning Platform with Instruction Data Collection and Unified Large Language Models Interface [Code]

  • vllm: A high-throughput and memory-efficient inference and serving engine for LLMs [Code]

  • LLaMA Efficient Tuning: Fine-tuning LLaMA with PEFT (PT+SFT+RLHF with QLoRA) [Code]

  • gpt-fast: Simple and efficient pytorch-native transformer text generation in <1000 LOC of python. [Code]

  • Efficient-Tuning-LLMs: (Efficient Finetuning of QLoRA LLMs). QLoRA, LLama, bloom, baichuan-7B, GLM [Code]

  • bitsandbytes: 8-bit CUDA functions for PyTorch [Code]

  • ExLlama: A more memory-efficient rewrite of the HF transformers implementation of Llama for use with quantized weights. [Code]

  • lit-gpt: Hackable implementation of state-of-the-art open-source LLMs based on nanoGPT. Supports flash attention, 4-bit and 8-bit quantization, LoRA and LLaMA-Adapter fine-tuning, pre-training. [Code]

  • Lit-LLaMA: Implementation of the LLaMA language model based on nanoGPT. Supports flash attention, Int8 and GPTQ 4bit quantization, LoRA and LLaMA-Adapter fine-tuning, pre-training. [Code]

  • lama.onnx: LLaMa/RWKV onnx models, quantization and testcase [Code]

  • fastLLaMa: An experimental high-performance framework for running Decoder-only LLMs with 4-bit quantization in Python using a C/C++ backend. [Code]

  • Sparsebit: A model compression and acceleration toolbox based on pytorch. [Code]

  • llama2.c: Inference Llama 2 in one file of pure C [Code]

  • Megatron-LM: Ongoing research training transformer models at scale [Code]

  • ggml: Tensor library for machine learning [Code]

  • LLamaSharp: C#/.NET binding of llama.cpp, including LLaMa/GPT model inference and quantization, ASP.NET core integration and UI [Code]

  • rwkv.cpp: NT4/INT5/INT8 and FP16 inference on CPU for RWKV language model [Code]

  • Can my GPU run this LLM?: Calculate GPU memory requirement & breakdown for training/inference of LLM models. Supports ggml/bnb quantization [Code]

  • TinyChatEngine: On-Device LLM Inference Library [Code]

  • TensorRT-LLM: TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. [Code]

  • IntLLaMA: A fast and light quantization solution for LLaMA [Code]

  • EasyLLM: Built upon Megatron-Deepspeed and HuggingFace Trainer, EasyLLM has reorganized the code logic with a focus on usability. While enhancing usability, it also ensures training efficiency [Code]

  • GreenBit LLaMA: Advanced Ultra-Low Bitrate Compression Techniques for the LLaMA Family of LLMs [Code]

Contributing

This is an active repository and your contributions are always welcome! Before you add papers/tools into the awesome list, please make sure that:

  • The paper or tools is related to Large Language Models (LLMs). If the compression algorithms or tools are only evaluated on small-scale language models (e.g., BERT), they should not be included in the list.
  • The paper should be inserted in the correct position in chronological order (publication/arxiv release time).
  • The link to [Paper] should be the arxiv page, not the pdf page if this is a paper posted on arxiv.

Star History

Star History Chart