Awesome-LLM-System-Papers

This is a list of (non-comprehensive) LLM system papers maintained by ALCHEM Lab. Welcome to create a pull requst or an issue if we have missed any interesting papers!

Algorithm-System Co-Design

Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity (JMLR'21) link to paper
Scalable and Efficient MoE Training for Multitask Multilingual Models (arXiv'21) link to paper
DeepSpeed-MOE: Advancing Mixture of Experts Inference and Training to Power Next-Generation AI Scale (ICML'22) link to paper

LLM Inference (Serving) Systems

Single-GPU Systems

TurboTransformers: An Efficient GPU Serving System For Transformer Models (PPoPP'21) link to paper
PetS: A Unified Framework for Parameter-Efficient Transformers Serving (ATC'22) link to paper

Distributed Systems

Orca: A Distributed Serving System for Transformer-Based Generative Models (OSDI'22) link to paper
DeepSpeed-inference: enabling efficient inference of transformer models at unprecedented scale (SC'22) link to paper
EnergeonAI: An Inference System for 10-100 Billion Parameter Transformer Models (arXiv'22) link to paper
PETALS: Collaborative Inference and Fine-tuning of Large Models (NeurIPS'22 Workshop WBRC) link to paper

LLM Training Systems

Single-GPU Systems

CRAMMING: Training a Language Model on a Single GPU in One Day (arXiv'22) link to paper
Easy and Efficient Transformer : Scalable Inference Solution For large NLP model (arXiv'22) link to paper
High-throughput Generative Inference of Large Language Models with a Single GPU (arXiv'23) link to paper

Distributed Systems

ZeRO: Memory optimizations Toward Training Trillion Parameter Models (SC'20) link to paper
Megatron-lm: Training multi-billion parameter language models using model parallelism (arXiv'20) link to paper
PipeTransformer: Automated Elastic Pipelining for Distributed Training of Large-scale Models Algorithm (ICML'21) link to paper
Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM (SC'21) link to paper
TeraPipe: Token-Level Pipeline Parallelism for Training Large-Scale Language Models (ICML'21) link to paper
Using DeepSpeed and Megatron to Train Megatron-Turing NLG 530B, A Large-Scale Generative Language Model (arXiv'22) link to paper
Alpa: Automating Inter- and Intra-Operator Parallelism for Distributed Deep Learning (OSDI'22) link to paper
LightSeq2: Accelerated Training for Transformer-Based Models on GPUs (SC'22) link to paper
PanGu-Σ: Towards Trillion Parameter Language Model with Sparse Heterogeneous Computing (arXiv'23) link to paper
Mobius: Fine Tuning Large-Scale Models on Commodity GPU Servers (ASPLOS'23) link to paper
Optimus-CC: Efficient Large NLP Model Training with 3D Parallelism Aware Communication Compression（ASPLOS'23) link to paper

General MLSys-Related Techniques (Not Complete)

Efficient GPU Spatial-Temporal Multitasking (TPDS'14) link to paper
Enabling preemptive multiprogramming on GPUs (ISCA'14) link to paper
Chimera: Collaborative Preemption for Multitasking on a Shared GPU (ASPLOS'15) link to paper
Simultaneous Multikernel GPU: Multi-tasking Throughput Processors via Fine-Grained Sharing (HPCA'16) link to paper
FLEP: Enabling Flexible and Efficient Preemption on GPUs (ASPLOS'17) link to paper
Dynamic Resource Management for Efficient Utilization of Multitasking GPUs (ASPLOS'17) link to paper
Mesh-TensorFlow: Deep Learning for Supercomputers (NeurIPS'18) link to paper
PipeDream: Fast and Efficient Pipeline Parallel DNN Training (SOSP'19) link to paper
GPipe: Easy Scaling with Micro-Batch Pipeline Parallelism (NeurIPS'19) link to paper
PipeSwitch: Fast Pipelined Context Switching for Deep Learning Applications (OSDI'20) link to paper
Microsecond-scale Preemption for Concurrent GPU-accelerated DNN Inferences (OSDI'22) link to paper
Overlap Communication with Dependent Computation via Decomposition in Large Deep Learning Models (ASPLOS'23) link to paper

LLM Algorithm Papers Recommended for System Researchers

Attention is all you need (NeurIPS'17) link to paper
Language Models are Unsupervised Multitask Learners (preprint from OpenAI) link to paper
Improving Language Understanding by Generative Pretraining (preprint from OpenAI) link to paper
Language Models are Few-Shot Learners (NeurIPS'20) link to paper
Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer (JMLR'20) link to paper
Multitask Prompted Training Enables Zero-Shot Task Generalization (ICLR'22) link to paper
Finetuned Language Models are Zero-Shot Learners (ICLR'22) link to paper
GLaM: Efficient Scaling of Language Models with Mixture-of-Experts (ICML'22) link to paper
LaMDA: Language Models for Dialog Applications (arXiv'22) link to paper
PaLM: Scaling Language Modeling with Pathways (arXiv'22) link to paper
OPT: Open Pre-trained Transformer Language Models (arXiv'22) link to paper
Holistic Evaluation of Language Models (arXiv'22) link to paper
BLOOM: A 176B-Parameter Open-Access Multilingual Language Model (arXiv'23) link to paper
LLaMA: Open and Efficient Foundation Language Models (arXiv'23) link to paper
DeepMind: Training Compute Optimal Large Language Models (preprint from DeepMind) link to paper
Scaling Laws for Neural Language Models (preprint) link to paper
Scaling Language Models: Methods, Analysis & Insights from Training Gopher (preprint from DeepMind) link to paper

MARD1NO/Awesome-LLM-System-Papers

Awesome-LLM-System-Papers

Algorithm-System Co-Design

LLM Inference (Serving) Systems

Single-GPU Systems

Distributed Systems

LLM Training Systems

Single-GPU Systems

Distributed Systems

General MLSys-Related Techniques (Not Complete)

LLM Algorithm Papers Recommended for System Researchers

Other Useful Resources