Awesome-LLM-System-Papers

This is a list of (non-comprehensive) LLM system papers maintained by ALCHEM Lab. Welcome to create a pull requst or an issue if we have missed any interesting papers!

Algorithm-System Co-Design

  • Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity (JMLR'21) link to paper
  • Scalable and Efficient MoE Training for Multitask Multilingual Models (arXiv'21) link to paper
  • DeepSpeed-MOE: Advancing Mixture of Experts Inference and Training to Power Next-Generation AI Scale (ICML'22) link to paper

LLM Inference (Serving) Systems

  • Orca: A Distributed Serving System for Transformer-Based Generative Models (OSDI'22) link to paper
  • TurboTransformers: An Efficient GPU Serving System For Transformer Models (PPoPP'21) link to paper
  • PetS: A Unified Framework for Parameter-Efficient Transformers Serving (ATC'22) link to paper
  • DeepSpeed-inference: enabling efficient inference of transformer models at unprecedented scale (SC'22) link to paper
  • EnergeonAI: An Inference System for 10-100 Billion Parameter Transformer Models (arXiv'22) link to paper
  • PETALS: Collaborative Inference and Fine-tuning of Large Models (NeurIPS'22 Workshop WBRC) link to paper
  • SpecInfer: Accelerating Generative LLM Serving with Speculative Inference and Token Tree Verification (preprint'23) link to paper
  • Fast Distributed Inference Serving for Large Language Models (arXiv'23) link to paper
  • FlexGen: High-Throughput Generative Inference of Large Language Models with a Single GPU (ICML'23) link to paper
  • PowerInfer: Fast Large Language Model Serving with a Consumer-grade GPU (preprint'23) link to paper
  • LLM in a flash: Efficient Large Language Model Inference with Limited Memory (arXiv'23) link to paper
  • An Efficient Sparse Inference Software Accelerator for Transformer-based Language Models on CPUs (arXiv'23) link to paper
  • Accelerating LLM Inference with Staged Speculative Decoding (arXiv'23) link to paper
  • Efficient Memory Management for Large Language Model Serving with PagedAttention (SOSP'23) link to paper
  • EdgeMoE: Fast On-Device Inference of MoE-based Large Language Models (arXiv'23) link to paper
  • Draft & Verify: Lossless Large Language Model Acceleration via Self-Speculative Decoding (arXiv'23) link to paper
  • S3: Increasing GPU Utilization during Generative Inference for Higher Throughput (arXiv'23) link to paper
  • Punica: Multi-Tenant LoRA Serving (arXiv'23) link to paper
  • S-LoRA: Serving Thousands of Concurrent LoRA Adapters (arXiv'23) link to paper
  • Deja Vu: Contextual Sparsity for Efficient LLMs at Inference Time (ICML'23) link to paper
  • Splitwise: Efficient Generative LLM Inference Using Phase Splitting (arXiv'23, update: ISCA'24) link to paper
  • SARATHI: Efficient LLM Inference by Piggybacking Decodes with Chunked Prefills (arXiv'23) link to paper
  • SuperServe: Fine-Grained Inference Serving for Unpredictable Workloads (arXiv'23) link to paper
  • Efficiently Programming Large Language Models using SGLang (arXiv'23) link to paper
  • SpotServe: Serving Generative Large Language Models on Preemptible Instances (ASPLOS'24) link to paper
  • Multi-Candidate Speculative Decoding (arXiv'24) link to paper
  • Infinite-LLM: Efficient LLM Service for Long Context with DistAttention and Distributed KVCache (arXiv'24) link to paper
  • Break the Sequential Dependency of LLM Inference Using Lookahead Decoding (arXiv'24) link to paper
  • FASTDECODE: High-Throughput GPU-Efficient LLM Serving using Heterogeneous Pipelines (arXiv'24) link to paper
  • FlexLLM: A System for Co-Serving Large Language Model Inference and Parameter-Efficient Finetuning (arXiv'24) link to paper
  • Inference without Interference: Disaggregate LLM Inference for Mixed Downstream Workloads (arXiv'24) link to paper
  • MuxServe: Flexible Multiplexing for Efficient Multiple LLM Serving (arXiv'24) link to paper
  • DistServe: Disaggregating Prefill and Decoding for Goodput-optimized Large Language Model Serving (arXiv'24) link to paper
  • DeFT: Flash Tree-attention with IO-Awareness for Efficient Tree-search-based LLM Inference (ICLR'24)link to paper

On-device LLM Inference (Serving) Systems

  • PowerInfer-2: Fast Large Language Model Inference on a Smartphone (arXiv'24) link to paper
  • Empowering 1000 tokens/second on-device LLM prefilling with mllm-NPU (arXiv'24) link to paper

Profiling and Benchmark Systems

  • MELTing point: Mobile Evaluation of Language Transformers (MobiCom'24) link to paper
  • MobileAIBench: Benchmarking LLMs and LMMs for On-Device Use Cases (NeurIPS'24) link to paper

LLM Training Systems

Single-GPU Systems

  • CRAMMING: Training a Language Model on a Single GPU in One Day (arXiv'22) link to paper
  • Easy and Efficient Transformer : Scalable Inference Solution For large NLP model (arXiv'22) link to paper
  • High-throughput Generative Inference of Large Language Models with a Single GPU (arXiv'23) link to paper
  • ByteTransformer: A High-Performance Transformer Boosted for Variable-Length Inputs (arXiv'23) link to paper

Distributed Systems

  • ZeRO: Memory optimizations Toward Training Trillion Parameter Models (SC'20) link to paper
  • Megatron-lm: Training multi-billion parameter language models using model parallelism (arXiv'20) link to paper
  • PipeTransformer: Automated Elastic Pipelining for Distributed Training of Large-scale Models Algorithm (ICML'21) link to paper
  • Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM (SC'21) link to paper
  • TeraPipe: Token-Level Pipeline Parallelism for Training Large-Scale Language Models (ICML'21) link to paper
  • FastMoE: A Fast Mixture-of-Expert Training System (arXiv'21) link to paper
  • Using DeepSpeed and Megatron to Train Megatron-Turing NLG 530B, A Large-Scale Generative Language Model (arXiv'22) link to paper
  • Alpa: Automating Inter- and Intra-Operator Parallelism for Distributed Deep Learning (OSDI'22) link to paper
  • LightSeq2: Accelerated Training for Transformer-Based Models on GPUs (SC'22) link to paper
  • Pathways: Asynchronous Distributed Dataflow for ML (arXiv'22) link to paper
  • FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness (NeurIPS'22) link to paper
  • Varuna: Scalable, Low-cost Training of Massive Deep Learning Models (EuroSys'22) link to paper
  • FasterMoE: modeling and optimizing training of large-scale dynamic pre-trained models (PPoPP'22) link to paper
  • PanGu-Σ: Towards Trillion Parameter Language Model with Sparse Heterogeneous Computing (arXiv'23) link to paper
  • Mobius: Fine Tuning Large-Scale Models on Commodity GPU Servers (ASPLOS'23) link to paper
  • Optimus-CC: Efficient Large NLP Model Training with 3D Parallelism Aware Communication Compression(ASPLOS'23) link to paper
  • ZeRO++: Extremely Efficient Collective Communication for Giant Model Training (arXiv'23) link to paper
  • A Hybrid Tensor-Expert-Data Parallelism Approach to Optimize Mixture-of-Experts Training (ICS'23) link to paper
  • BPIPE: Memory-Balanced Pipeline Parallelism for Training Large Language Models (ICML'23) link to paper
  • Optimized Network Architectures for Large Language Model Training with Billions of Parameters (arXiv'23) link to paper
  • SWARM Parallelism: Training Large Models Can Be Surprisingly Communication-Efficient (arXiv'23) link to paper
  • Blockwise Parallel Transformer for Large Context Models (NeurIPS'23) link to paper
  • Ring Attention with Blockwise Transformers for Near-Infinite Context (arXiv'23) link to paper
  • DeepSpeed Ulysses: System Optimizations for Enabling Training of Extreme Long Sequence Transformer Models (arXiv'23) link to paper
  • Effective Long-Context Scaling of Foundation Models (arXiv'23) link to paper
  • GrowLength: Accelerating LLMs Pretraining by Progressively Growing Training Length (arXiv'23) link to paper
  • LightSeq: Sequence Level Parallelism for Distributed Training of Long Context Transformers (arXiv'23) link to paper
  • Efficient Streaming Language Models with Attention Sinks (arXiv'23) link to paper
  • PatrickStar: Parallel Training of Pre-trained Models via Chunk-based Memory Management (TPDS'23) link to paper

General MLSys-Related Techniques (Incomplete)

  • Efficient GPU Spatial-Temporal Multitasking (TPDS'14) link to paper
  • Enabling preemptive multiprogramming on GPUs (ISCA'14) link to paper
  • Chimera: Collaborative Preemption for Multitasking on a Shared GPU (ASPLOS'15) link to paper
  • Simultaneous Multikernel GPU: Multi-tasking Throughput Processors via Fine-Grained Sharing (HPCA'16) link to paper
  • FLEP: Enabling Flexible and Efficient Preemption on GPUs (ASPLOS'17) link to paper
  • Dynamic Resource Management for Efficient Utilization of Multitasking GPUs (ASPLOS'17) link to paper
  • Mesh-TensorFlow: Deep Learning for Supercomputers (NeurIPS'18) link to paper
  • PipeDream: Fast and Efficient Pipeline Parallel DNN Training (SOSP'19) link to paper
  • GPipe: Easy Scaling with Micro-Batch Pipeline Parallelism (NeurIPS'19) link to paper
  • PipeSwitch: Fast Pipelined Context Switching for Deep Learning Applications (OSDI'20) link to paper
  • Microsecond-scale Preemption for Concurrent GPU-accelerated DNN Inferences (OSDI'22) link to paper
  • Overlap Communication with Dependent Computation via Decomposition in Large Deep Learning Models (ASPLOS'23) link to paper
  • AlpaServe: Statistical Multiplexing with Model Parallelism for Deep Learning Serving (OSDI'23) link to paper
  • Benchmarking and Dissecting the Nvidia Hopper GPU Architecture (IPDPS'24) link to paper

LLM Algorithm Papers Recommended for System Researchers

  • Attention is all you need (NeurIPS'17) link to paper
  • Language Models are Unsupervised Multitask Learners (preprint from OpenAI) link to paper
  • Improving Language Understanding by Generative Pretraining (preprint from OpenAI) link to paper
  • Language Models are Few-Shot Learners (NeurIPS'20) link to paper
  • GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding (ICLR'20) link to paper
  • Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer (JMLR'20) link to paper
  • Multitask Prompted Training Enables Zero-Shot Task Generalization (ICLR'22) link to paper
  • Finetuned Language Models are Zero-Shot Learners (ICLR'22) link to paper
  • GLaM: Efficient Scaling of Language Models with Mixture-of-Experts (ICML'22) link to paper
  • Training language models to follow instructions with human feedback (NeurIPS'22) link to paper
  • LaMDA: Language Models for Dialog Applications (arXiv'22) link to paper
  • PaLM: Scaling Language Modeling with Pathways (arXiv'22) link to paper
  • Lora: Low-rank adaptation of large language models (ICLR'22) link to paper
  • OPT: Open Pre-trained Transformer Language Models (arXiv'22) link to paper
  • Holistic Evaluation of Language Models (arXiv'22) link to paper
  • BLOOM: A 176B-Parameter Open-Access Multilingual Language Model (arXiv'23) link to paper
  • LLaMA: Open and Efficient Foundation Language Models (arXiv'23) link to paper
  • DeepMind: Training Compute Optimal Large Language Models (preprint from DeepMind) link to paper
  • Scaling Laws for Neural Language Models (preprint) link to paper
  • Scaling Language Models: Methods, Analysis & Insights from Training Gopher (preprint from DeepMind) link to paper
  • LLM-Adapters: An Adapter Family for Parameter-Efficient Fine-Tuning of Large Language Models (arXiv'23) link to paper
  • RWKV: Reinventing RNNs for the Transformer Era (arXiv'23) link to paper
  • LongNet: Scaling Transformers to 1,000,000,000 (arXiv'23)link to paper
  • SkipDecode: Autoregressive Skip Decoding with Batching and Caching for Efficient LLM Inference (arXiv'23)link to paper
  • FlashAttention2: Faster Attention with Better Parallelism and Work Partitioning (arXiv'23)link to paper
  • Retentive Network: A Successor to Transformers for Large Language Models (arXiv'23)link to paper
  • TransNormer: Scaling TransNormer to 175 Billion Parameters (arXiv'23)link to paper
  • Skeleton-of-Thought: Large Language Models Can Do Parallel Decoding (arXiv'23)link to paper
  • From Sparse to Soft Mixture of Experts (arXiv'23)link to paper
  • One Wide Feedforward is All You Need (arXiv'23)link to paper
  • Gated recurrent neural networks discover attention (arXiv'23)link to paper
  • Pushing Mixture of Experts to the Limit: Extremely Parameter Efficient MoE for Instruction Tuning (arXiv'23)link to paper
  • Scaling Laws for Sparsely-Connected Foundation Models (arXiv'23)link to paper
  • Sorted LLaMA: Unlocking the Potential of Intermediate Layers of Large Language Models for Dynamic Inference Using Sorted Fine-Tuning (SoFT) (arXiv'23)link to paper
  • LongLoRA: Efficient Fine-tuning of Long-Context Large Language Models (arXiv'23)link to paper
  • PoSE: Efficient Context Window Extension of LLMs via Positional Skip-wise Training (arXiv'23)link to paper
  • Retrieval meets Long Context Large Language Models (arXiv'23)link to paper
  • HyperAttention: Long-context Attention in Near-Linear Time (arXiv'23)link to paper

Survyes

  • A Survey of Large Language Models (arXiv'23) link to paper
  • Challenges and Applications of Large Language Models (arXiv'23)link to paper
  • FLM-101B: An Open LLM and How to Train It with $100K Budget (arXiv'23)link to paper
  • Towards Efficient Generative Large Language Model Serving: A Survey from Algorithms to Systems (arXiv'23)link to paper

Awesome Open-Sourced LLMSys Projects

Other Useful Resources