LLMs Acceleration Paper List Here is the list of papers categorized by technology. A list categorized by conference will be added in the future. Methodology Papers Publication Venue/Affiliations Materials Quantization SpQR: A Sparse-Quantized Representation for Near-Lossless LLM Weight Compression Arxiv 2023 code AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration Arxiv 2023 code GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers ICLR 2023 code OWQ: Lessons learned from activation outliers for weight quantization in large language models Arxiv 2023 code LUT-GEMM: Quantized Matrix Multiplication based on LUTs for Efficient Inference in Large-Scale Generative Language Models Arxiv 2022 Zeroquant: Efficient and affordable post-training quantization for large-scale transformers NeurIPS 2022 code SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models ICML 2023 code RPTQ: Reorder-based Post-training Quantization for Large Language Models Arxiv 2023 code Outlier Suppression+: Accurate quantization of large language models by equivalent and optimal shifting and scaling Arxiv 2023 Sparsity Sheared LLaMA: Accelerating Language Model Pre-training via Structured Pruning Arxiv 2023 code A Simple and Effective Pruning Approach for Large Language Models Arxiv 2023 code LLM-Pruner: On the Structural Pruning of Large Language Models Arxiv 2023 code Deja Vu: Contextual Sparsity for Efficient LLMs at Inference Time ICML 2023 code SparseGPT: Massive Language Models Can be Accurately Pruned in One-Shot ICML 2023 code Attention Pattern Efficient Streaming Language Models with Attention Sinks Arxiv 2023 code LongLORA: Efficient Fine-tuning of Long Context Large Language Model Arxiv 2023 code Fast Multipole Attention: A Divide-and-Conquer Attention Mechanism for Long Sequences Arxiv 2023 Architecture-level Flash-LLM: Enabling Cost-Effective and Highly-Efficient Large Generative Model Inference with Unstructured Sparsity VLDB 2024 code vLLM: Easy, Fast, and Cheap LLM Serving with PagedAttention Arxiv 2023 code FLASHATTENTION: Fast and Memory-Efficient Exact Attention with IO-Awareness NeurIPS 2022 code System-level FlexGen: High-Throughput Generative Inference of Large Language Models with a Single GPU ICML 2023 code