LLMs-Acceleration: A repository from Zhenyu001225

LLMs Acceleration Paper List

Here is the list of papers categorized by technology.

A list categorized by conference will be added in the future.

Methodology	Papers	Publication Venue/Affiliations	Materials
Quantization	SpQR: A Sparse-Quantized Representation for Near-Lossless LLM Weight Compression	Arxiv 2023	code
	AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration	Arxiv 2023	code
	GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers	ICLR 2023	code
	OWQ: Lessons learned from activation outliers for weight quantization in large language models	Arxiv 2023	code
	LUT-GEMM: Quantized Matrix Multiplication based on LUTs for Efficient Inference in Large-Scale Generative Language Models	Arxiv 2022
	Zeroquant: Efficient and affordable post-training quantization for large-scale transformers	NeurIPS 2022	code
	SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models	ICML 2023	code
	RPTQ: Reorder-based Post-training Quantization for Large Language Models	Arxiv 2023	code
	Outlier Suppression+: Accurate quantization of large language models by equivalent and optimal shifting and scaling	Arxiv 2023

Sparsity	Sheared LLaMA: Accelerating Language Model Pre-training via Structured Pruning	Arxiv 2023	code
	A Simple and Effective Pruning Approach for Large Language Models	Arxiv 2023	code
	LLM-Pruner: On the Structural Pruning of Large Language Models	Arxiv 2023	code
	Deja Vu: Contextual Sparsity for Efficient LLMs at Inference Time	ICML 2023	code
	SparseGPT: Massive Language Models Can be Accurately Pruned in One-Shot	ICML 2023	code

Attention Pattern	Efficient Streaming Language Models with Attention Sinks	Arxiv 2023	code
	LongLORA: Efficient Fine-tuning of Long Context Large Language Model	Arxiv 2023	code
	Fast Multipole Attention: A Divide-and-Conquer Attention Mechanism for Long Sequences	Arxiv 2023

Architecture-level	Flash-LLM: Enabling Cost-Effective and Highly-Efficient Large Generative Model Inference with Unstructured Sparsity	VLDB 2024	code
	vLLM: Easy, Fast, and Cheap LLM Serving with PagedAttention	Arxiv 2023	code
	FLASHATTENTION: Fast and Memory-Efficient Exact Attention with IO-Awareness	NeurIPS 2022	code
System-level	FlexGen: High-Throughput Generative Inference of Large Language Models with a Single GPU	ICML 2023	code

Zhenyu001225/LLMs-Acceleration

LLMs Acceleration Paper List