/LLMs-Acceleration

đź“•Large Language Models Acceleration Paper List

LLMs Acceleration Paper List

Here is the list of papers categorized by technology.

A list categorized by conference will be added in the future.

Methodology Papers Publication Venue/Affiliations Materials
Quantization SpQR: A Sparse-Quantized Representation for Near-Lossless LLM Weight Compression Arxiv 2023 code
AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration Arxiv 2023 code
GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers ICLR 2023 code
OWQ: Lessons learned from activation outliers for weight quantization in large language models Arxiv 2023 code
LUT-GEMM: Quantized Matrix Multiplication based on LUTs for Efficient Inference in Large-Scale Generative Language Models Arxiv 2022
Zeroquant: Efficient and affordable post-training quantization for large-scale transformers NeurIPS 2022 code
SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models ICML 2023 code
RPTQ: Reorder-based Post-training Quantization for Large Language Models Arxiv 2023 code
Outlier Suppression+: Accurate quantization of large language models by equivalent and optimal shifting and scaling Arxiv 2023
Sparsity Sheared LLaMA: Accelerating Language Model Pre-training via Structured Pruning Arxiv 2023 code
A Simple and Effective Pruning Approach for Large Language Models Arxiv 2023 code
LLM-Pruner: On the Structural Pruning of Large Language Models Arxiv 2023 code
Deja Vu: Contextual Sparsity for Efficient LLMs at Inference Time ICML 2023 code
SparseGPT: Massive Language Models Can be Accurately Pruned in One-Shot ICML 2023 code
Attention Pattern Efficient Streaming Language Models with Attention Sinks Arxiv 2023 code
LongLORA: Efficient Fine-tuning of Long Context Large Language Model Arxiv 2023 code
Fast Multipole Attention: A Divide-and-Conquer Attention Mechanism for Long Sequences Arxiv 2023
Architecture-level Flash-LLM: Enabling Cost-Effective and Highly-Efficient Large Generative Model Inference with Unstructured Sparsity VLDB 2024 code
vLLM: Easy, Fast, and Cheap LLM Serving with PagedAttention Arxiv 2023 code
FLASHATTENTION: Fast and Memory-Efficient Exact Attention with IO-Awareness NeurIPS 2022 code
System-level FlexGen: High-Throughput Generative Inference of Large Language Models with a Single GPU ICML 2023 code