/ml-systems-papers

Curated collection of papers in machine learning systems

Paper List for Machine Learning Systems

Paper list for broad topics in machine learning systems

NOTE: Survey papers are annotated with [Survey 🔍] prefix.

Table of Contents

1. Data Processing

1.1 Data pipeline optimization

1.1.1 General

  • [arxiv'24] cedar: Composable and Optimized Machine Learning Input Data Pipelines
  • [MLSys'22] Plumber: Diagnosing and Removing Performance Bottlenecks in Machine Learning Data Pipelines
  • [ISCA'22] Understanding Data Storage and Ingestion for Large-Scale Deep Recommendation Model Training
  • [SIGMOD'22] Where Is My Training Bottleneck? Hidden Trade-Offs in Deep Learning Preprocessing Pipelines
  • [VLDB'21] Analyzing and Mitigating Data Stalls in DNN Training
  • [VLDB'21] tf.data: A Machine Learning Data Processing Framework

1.1.2 Prep stalls

1.1.3 Fetch stalls (I/O)

1.1.4 Specific workloads (GNN, DLRM)

1.2 Caching and Distributed storage for ML training

1.3 Data formats

  • [ECCV'22] L3: Accelerator-Friendly Lossless Image Format for High-Resolution, High-Throughput DNN Training
  • [VLDB'21] Progressive compressed records: Taking a byte out of deep learning data

1.4 Data pipeline fairness and correctness

  • [CIDR'21] Lightweight Inspection of Data Preprocessing in Native Machine Learning Pipelines

1.5 Data labeling automation

  • [VLDB'18] Snorkel: Rapid Training Data Creation with Weak Supervision

2. Training System

2.1 Empirical Study on ML Jobs

  • [ICSE'24] An Empirical Study on Low GPU Utilization of Deep Learning Jobs
  • [NSDI'24] Characterization of Large Language Model Development in the Datacenter
  • [NSDI'22] MLaaS in the wild: workload analysis and scheduling in large-scale heterogeneous GPU clusters (PAI)
  • [ATC'19] Analysis of Large-Scale Multi-Tenant GPU Clusters for DNN Training Workloads (Philly)

2.2 DL scheduling

2.3 GPU sharing

2.4 GPU memory management and optimization

2.5 GPU memory usage estimate

  • [ESEC/FSE'20] Estimating GPU memory consumption of deep learning models

2.6 Distributed training (Parallelism)

2024

2023

2022

2021

  • [arxiv'21] Amazon SageMaker Model Parallelism: A General and Flexible Framework for Large Model Training
  • [arxiv'21] GSPMD: General and Scalable Parallelization for ML Computation Graphs
  • [JMLR'21] Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity
  • [TPDS'21] TensorOpt: Exploring the Tradeoffs in Distributed DNN Training With Auto-Parallelism
  • [ATC'21] Fine-tuning giant neural networks on commodity hardware with automatic pipeline model parallelism
  • [SIGMOD'21] Heterogeneity-Aware Distributed Machine Learning Training via Partial Reduce [also in 2.10]
  • [MLSys'21] PipeMare: Asynchronous Pipeline Parallel DNN Training
  • [ICLR'21] GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding
  • [NeurIPS'21] Piper: Multidimensional Planner for DNN Parallelization
  • [ICML'21] Memory-Efficient Pipeline-Parallel DNN Training
  • [ICML'21] TeraPipe: Token-Level Pipeline Parallelism for Training Large-Scale Language Models
  • [ICML'21] PipeTransformer: Automated Elastic Pipelining for Distributed Training of Large-scale Models
  • [SC'21] Chimera: Efficiently Training Large-Scale Neural Networks with Bidirectional Pipelines
  • [SC'21] Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM (PTD-P or Megatron-LM v2)
  • [FAST'21] Behemoth: A Flash-centric Training Accelerator for Extreme-scale DNNs
  • [PPoPP'21] DAPPLE: a pipelined data parallel approach for training large models
  • [VLDB'21] Distributed Deep Learning on Data Systems: A Comparative Analysis of Approaches

2020

  • [HPCA'20] AccPar: Tensor Partitioning for Heterogeneous Deep Learning Accelerators
  • [NeurIPS'20] Efficient Algorithms for Device Placement of DNN Graph Operators
  • [arxiv'20] Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism
  • [KDD'20 Tutorial] DeepSpeed: System Optimizations Enable Training Deep Learning Models with Over 100 Billion Parameters
  • [VLDB'20] PyTorch Distributed: Experiences on Accelerating Data Parallel Training
  • [OSDI'20] A Unified Architecture for Accelerating Distributed DNN Training in Heterogeneous GPU/CPU Clusters (BytePS)
  • [SOSP'19] PipeDream: Generalized Pipeline Parallelism for DNN Training
  • [NeurIPS'20] Language Models are Few-Shot Learners [From OpenAI]
  • [arxiv'20] Scaling Laws for Neural Language Models [From OpenAI]

~2019

Survey Papers

  • [Survey 🔍] [IJCAI'22] Survey on Effcient Training of Large Neural Networks
  • [Survey 🔍] [ACM CSUR'19] Demystifying Parallel and Distributed Deep Learning
  • [Survey 🔍] [ACM CSUR'19] Scalable Deep Learning on Distributed Infrastructures: Challenges, Techniques, and Tools

2.7 DL job failures

2.8 Model checkpointing

  • [FAST'21] CheckFreq: Frequent, Fine-Grained DNN Checkpointing

2.9 AutoML

  • [OSDI'23] Hydro: Surrogate-Based Hyperparameter Tuning Service in Datacenters
  • [NSDI'23] ModelKeeper: Accelerating DNN Training via Automated Training Warmup
  • [OSDI'20] Retiarii: A Deep Learning Exploratory-Training Framework

2.10 Communication optimization

2.11 Energy-efficient DNN training (carbon-aware)

2.12 DNN compiler

2.13 Model pruning and compression

2.14 GNN training system

For comprehensive list of GNN systems papers, refer to https://github.com/chwan1016/awesome-gnn-systems.

2.15 Congestion control for DNN training

2.16 Others

3. Inference System

4. Mixture of Experts (MoE)

This is the list of papers about MoE training and inference (collected from 2.6 and 3).

5. Federated Learning

6. Privacy-Preserving ML

7. ML APIs & Application-side Optimization

8. ML for Systems

Others

References

This repository is motivated by: