Awesome-LLM-System-Papers

This is a list of (non-comprehensive) LLM system papers maintained by ALCHEM Lab. Welcome to create a pull requst or an issue if we have missed any interesting papers!

Algorithm-System Co-Design

Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity (JMLR'21) link to paper
Scalable and Efficient MoE Training for Multitask Multilingual Models (arXiv'21) link to paper
DeepSpeed-MOE: Advancing Mixture of Experts Inference and Training to Power Next-Generation AI Scale (ICML'22) link to paper

LLM Inference (Serving) Systems

Orca: A Distributed Serving System for Transformer-Based Generative Models (OSDI'22) link to paper
TurboTransformers: An Efficient GPU Serving System For Transformer Models (PPoPP'21) link to paper
PetS: A Unified Framework for Parameter-Efficient Transformers Serving (ATC'22) link to paper
DeepSpeed-inference: enabling efficient inference of transformer models at unprecedented scale (SC'22) link to paper
EnergeonAI: An Inference System for 10-100 Billion Parameter Transformer Models (arXiv'22) link to paper
PETALS: Collaborative Inference and Fine-tuning of Large Models (NeurIPS'22 Workshop WBRC) link to paper
SpecInfer: Accelerating Generative LLM Serving with Speculative Inference and Token Tree Verification (preprint'23) link to paper
Fast Distributed Inference Serving for Large Language Models (arXiv'23) link to paper
FlexGen: High-Throughput Generative Inference of Large Language Models with a Single GPU (ICML'23) link to paper
PowerInfer: Fast Large Language Model Serving with a Consumer-grade GPU (preprint'23) link to paper
LLM in a flash: Efficient Large Language Model Inference with Limited Memory (arXiv'23) link to paper
An Efficient Sparse Inference Software Accelerator for Transformer-based Language Models on CPUs (arXiv'23) link to paper
Accelerating LLM Inference with Staged Speculative Decoding (arXiv'23) link to paper
Efficient Memory Management for Large Language Model Serving with PagedAttention (SOSP'23) link to paper
EdgeMoE: Fast On-Device Inference of MoE-based Large Language Models (arXiv'23) link to paper
Draft & Verify: Lossless Large Language Model Acceleration via Self-Speculative Decoding (arXiv'23) link to paper
S3: Increasing GPU Utilization during Generative Inference for Higher Throughput (arXiv'23) link to paper
Punica: Multi-Tenant LoRA Serving (arXiv'23) link to paper
S-LoRA: Serving Thousands of Concurrent LoRA Adapters (arXiv'23) link to paper
Deja Vu: Contextual Sparsity for Efficient LLMs at Inference Time (ICML'23) link to paper
Splitwise: Efficient Generative LLM Inference Using Phase Splitting (arXiv'23, update: ISCA'24) link to paper
SARATHI: Efficient LLM Inference by Piggybacking Decodes with Chunked Prefills (arXiv'23) link to paper
SuperServe: Fine-Grained Inference Serving for Unpredictable Workloads (arXiv'23) link to paper
Efficiently Programming Large Language Models using SGLang (arXiv'23) link to paper
SpotServe: Serving Generative Large Language Models on Preemptible Instances (ASPLOS'24) link to paper
Multi-Candidate Speculative Decoding (arXiv'24) link to paper
Infinite-LLM: Efficient LLM Service for Long Context with DistAttention and Distributed KVCache (arXiv'24) link to paper
Break the Sequential Dependency of LLM Inference Using Lookahead Decoding (arXiv'24) link to paper
FASTDECODE: High-Throughput GPU-Efficient LLM Serving using Heterogeneous Pipelines (arXiv'24) link to paper
FlexLLM: A System for Co-Serving Large Language Model Inference and Parameter-Efficient Finetuning (arXiv'24) link to paper
Inference without Interference: Disaggregate LLM Inference for Mixed Downstream Workloads (arXiv'24) link to paper
MuxServe: Flexible Multiplexing for Efficient Multiple LLM Serving (arXiv'24) link to paper
DistServe: Disaggregating Prefill and Decoding for Goodput-optimized Large Language Model Serving (arXiv'24) link to paper
DeFT: Flash Tree-attention with IO-Awareness for Efficient Tree-search-based LLM Inference (ICLR'24)link to paper

On-device LLM Inference (Serving) Systems

PowerInfer-2: Fast Large Language Model Inference on a Smartphone (arXiv'24) link to paper
Empowering 1000 tokens/second on-device LLM prefilling with mllm-NPU (arXiv'24) link to paper

Profiling and Benchmark Systems

MELTing point: Mobile Evaluation of Language Transformers (MobiCom'24) link to paper
MobileAIBench: Benchmarking LLMs and LMMs for On-Device Use Cases (NeurIPS'24) link to paper

LLM Training Systems

Single-GPU Systems

CRAMMING: Training a Language Model on a Single GPU in One Day (arXiv'22) link to paper
Easy and Efficient Transformer : Scalable Inference Solution For large NLP model (arXiv'22) link to paper
High-throughput Generative Inference of Large Language Models with a Single GPU (arXiv'23) link to paper
ByteTransformer: A High-Performance Transformer Boosted for Variable-Length Inputs (arXiv'23) link to paper

Distributed Systems

ZeRO: Memory optimizations Toward Training Trillion Parameter Models (SC'20) link to paper
Megatron-lm: Training multi-billion parameter language models using model parallelism (arXiv'20) link to paper
PipeTransformer: Automated Elastic Pipelining for Distributed Training of Large-scale Models Algorithm (ICML'21) link to paper
Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM (SC'21) link to paper
TeraPipe: Token-Level Pipeline Parallelism for Training Large-Scale Language Models (ICML'21) link to paper
FastMoE: A Fast Mixture-of-Expert Training System (arXiv'21) link to paper
Using DeepSpeed and Megatron to Train Megatron-Turing NLG 530B, A Large-Scale Generative Language Model (arXiv'22) link to paper
Alpa: Automating Inter- and Intra-Operator Parallelism for Distributed Deep Learning (OSDI'22) link to paper
LightSeq2: Accelerated Training for Transformer-Based Models on GPUs (SC'22) link to paper
Pathways: Asynchronous Distributed Dataflow for ML (arXiv'22) link to paper
FlashAttention: Fast and Memory-Eﬃcient Exact Attention with IO-Awareness (NeurIPS'22) link to paper
Varuna: Scalable, Low-cost Training of Massive Deep Learning Models (EuroSys'22) link to paper
FasterMoE: modeling and optimizing training of large-scale dynamic pre-trained models (PPoPP'22) link to paper
PanGu-Σ: Towards Trillion Parameter Language Model with Sparse Heterogeneous Computing (arXiv'23) link to paper
Mobius: Fine Tuning Large-Scale Models on Commodity GPU Servers (ASPLOS'23) link to paper
Optimus-CC: Efficient Large NLP Model Training with 3D Parallelism Aware Communication Compression（ASPLOS'23) link to paper
ZeRO++: Extremely Efficient Collective Communication for Giant Model Training (arXiv'23) link to paper
A Hybrid Tensor-Expert-Data Parallelism Approach to Optimize Mixture-of-Experts Training (ICS'23) link to paper
BPIPE: Memory-Balanced Pipeline Parallelism for Training Large Language Models (ICML'23) link to paper
Optimized Network Architectures for Large Language Model Training with Billions of Parameters (arXiv'23) link to paper
SWARM Parallelism: Training Large Models Can Be Surprisingly Communication-Efficient (arXiv'23) link to paper
Blockwise Parallel Transformer for Large Context Models (NeurIPS'23) link to paper
Ring Attention with Blockwise Transformers for Near-Infinite Context (arXiv'23) link to paper
DeepSpeed Ulysses: System Optimizations for Enabling Training of Extreme Long Sequence Transformer Models (arXiv'23) link to paper
Effective Long-Context Scaling of Foundation Models (arXiv'23) link to paper
GrowLength: Accelerating LLMs Pretraining by Progressively Growing Training Length (arXiv'23) link to paper
LightSeq: Sequence Level Parallelism for Distributed Training of Long Context Transformers (arXiv'23) link to paper
Efficient Streaming Language Models with Attention Sinks (arXiv'23) link to paper
PatrickStar: Parallel Training of Pre-trained Models via Chunk-based Memory Management (TPDS'23) link to paper

General MLSys-Related Techniques (Incomplete)

Efficient GPU Spatial-Temporal Multitasking (TPDS'14) link to paper
Enabling preemptive multiprogramming on GPUs (ISCA'14) link to paper
Chimera: Collaborative Preemption for Multitasking on a Shared GPU (ASPLOS'15) link to paper
Simultaneous Multikernel GPU: Multi-tasking Throughput Processors via Fine-Grained Sharing (HPCA'16) link to paper
FLEP: Enabling Flexible and Efficient Preemption on GPUs (ASPLOS'17) link to paper
Dynamic Resource Management for Efficient Utilization of Multitasking GPUs (ASPLOS'17) link to paper
Mesh-TensorFlow: Deep Learning for Supercomputers (NeurIPS'18) link to paper
PipeDream: Fast and Efficient Pipeline Parallel DNN Training (SOSP'19) link to paper
GPipe: Easy Scaling with Micro-Batch Pipeline Parallelism (NeurIPS'19) link to paper
PipeSwitch: Fast Pipelined Context Switching for Deep Learning Applications (OSDI'20) link to paper
Microsecond-scale Preemption for Concurrent GPU-accelerated DNN Inferences (OSDI'22) link to paper
Overlap Communication with Dependent Computation via Decomposition in Large Deep Learning Models (ASPLOS'23) link to paper
AlpaServe: Statistical Multiplexing with Model Parallelism for Deep Learning Serving (OSDI'23) link to paper
Benchmarking and Dissecting the Nvidia Hopper GPU Architecture (IPDPS'24) link to paper

LLM Algorithm Papers Recommended for System Researchers

Attention is all you need (NeurIPS'17) link to paper
Language Models are Unsupervised Multitask Learners (preprint from OpenAI) link to paper
Improving Language Understanding by Generative Pretraining (preprint from OpenAI) link to paper
Language Models are Few-Shot Learners (NeurIPS'20) link to paper
GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding (ICLR'20) link to paper
Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer (JMLR'20) link to paper
Multitask Prompted Training Enables Zero-Shot Task Generalization (ICLR'22) link to paper
Finetuned Language Models are Zero-Shot Learners (ICLR'22) link to paper
GLaM: Efficient Scaling of Language Models with Mixture-of-Experts (ICML'22) link to paper
Training language models to follow instructions with human feedback (NeurIPS'22) link to paper
LaMDA: Language Models for Dialog Applications (arXiv'22) link to paper
PaLM: Scaling Language Modeling with Pathways (arXiv'22) link to paper
Lora: Low-rank adaptation of large language models (ICLR'22) link to paper
OPT: Open Pre-trained Transformer Language Models (arXiv'22) link to paper
Holistic Evaluation of Language Models (arXiv'22) link to paper
BLOOM: A 176B-Parameter Open-Access Multilingual Language Model (arXiv'23) link to paper
LLaMA: Open and Efficient Foundation Language Models (arXiv'23) link to paper
DeepMind: Training Compute Optimal Large Language Models (preprint from DeepMind) link to paper
Scaling Laws for Neural Language Models (preprint) link to paper
Scaling Language Models: Methods, Analysis & Insights from Training Gopher (preprint from DeepMind) link to paper
LLM-Adapters: An Adapter Family for Parameter-Efficient Fine-Tuning of Large Language Models (arXiv'23) link to paper
RWKV: Reinventing RNNs for the Transformer Era (arXiv'23) link to paper
LongNet: Scaling Transformers to 1,000,000,000 (arXiv'23)link to paper
SkipDecode: Autoregressive Skip Decoding with Batching and Caching for Efficient LLM Inference (arXiv'23)link to paper
FlashAttention2: Faster Attention with Better Parallelism and Work Partitioning (arXiv'23)link to paper
Retentive Network: A Successor to Transformers for Large Language Models (arXiv'23)link to paper
TransNormer: Scaling TransNormer to 175 Billion Parameters (arXiv'23)link to paper
Skeleton-of-Thought: Large Language Models Can Do Parallel Decoding (arXiv'23)link to paper
From Sparse to Soft Mixture of Experts (arXiv'23)link to paper
One Wide Feedforward is All You Need (arXiv'23)link to paper
Gated recurrent neural networks discover attention (arXiv'23)link to paper
Pushing Mixture of Experts to the Limit: Extremely Parameter Efficient MoE for Instruction Tuning (arXiv'23)link to paper
Scaling Laws for Sparsely-Connected Foundation Models (arXiv'23)link to paper
Sorted LLaMA: Unlocking the Potential of Intermediate Layers of Large Language Models for Dynamic Inference Using Sorted Fine-Tuning (SoFT) (arXiv'23)link to paper
LongLoRA: Efficient Fine-tuning of Long-Context Large Language Models (arXiv'23)link to paper
PoSE: Efficient Context Window Extension of LLMs via Positional Skip-wise Training (arXiv'23)link to paper
Retrieval meets Long Context Large Language Models (arXiv'23)link to paper
HyperAttention: Long-context Attention in Near-Linear Time (arXiv'23)link to paper

Survyes

A Survey of Large Language Models (arXiv'23) link to paper
Challenges and Applications of Large Language Models (arXiv'23)link to paper
FLM-101B: An Open LLM and How to Train It with $100K Budget (arXiv'23)link to paper
Towards Efficient Generative Large Language Model Serving: A Survey from Algorithms to Systems (arXiv'23)link to paper

AmadeusChan/Awesome-LLM-System-Papers

Awesome-LLM-System-Papers

Algorithm-System Co-Design

LLM Inference (Serving) Systems

On-device LLM Inference (Serving) Systems

Profiling and Benchmark Systems

LLM Training Systems

Single-GPU Systems

Distributed Systems

General MLSys-Related Techniques (Incomplete)

LLM Algorithm Papers Recommended for System Researchers

Survyes

Awesome Open-Sourced LLMSys Projects

Other Useful Resources