This repo is motivated by awesome tensor compilers.
-
Alpa: Automating Inter- and Intra-Operator Parallelism for Distributed Deep Learning OSDI'22
-
Colossal-AI: A Unified Deep Learning System For Large-Scale Parallel Training
-
HET: Scaling out Huge Embedding Model Training via Cache-enabled Distributed Framework VLDB'22
-
Bamboo: Making Preemptible Instances Resilient for Affordable Training of Large DNNs NSDI'23
-
Optimus-CC: Efficient Large NLP Model Training with 3D Parallelism Aware Communication Compression ASPLOS'23
-
Alpa: Automating Inter- and Intra-Operator Parallelism for Distributed Deep Learning OSDI'22
-
AMP: Automatically Finding Model Parallel Strategies with Heterogeneity Awareness NeurIPS '22
-
Varuna: Scalable, Low-cost Training of Massive Deep Learning Models Eurosys'22
-
Megatron-LM SC'21
-
Chimera: efficiently training large-scale neural networks with bidirectional pipelines SC'21
-
Piper: Multidimensional Planner for DNN Parallelization NeurIPS'21
-
PipeTransformer: Automated Elastic Pipelining for Distributed Training of Large-scale Models ICML'21
-
DAPPLE: An Efficient Pipelined Data Parallel Approach for Large Models Training PPOPP'21
-
TeraPipe:Large-Scale Language Modeling with Pipeline Parallelism ICML'21
-
SWARM Parallelism: Training Large Models Can Be Surprisingly Communication-Efficient
-
ModelKeeper: Accelerating DNN Training via Automated Training Warmup NSDI'23
-
STRONGHOLD: Fast and Affordable Billion-scale Deep Learning Model Training SC'22
-
Whale: Efficient Giant Model Training over Heterogeneous {GPUs} ATC'22
-
GeePS: Scalable Deep Learning on Distributed GPUs with a GPU-Specialized Parameter Server Eurosys'16
-
Paella: Low-latency Model Serving with Virtualized GPU Scheduling SOSP'23
-
Beta: Statistical Multiplexing with Model Parallelism for Deep Learning Serving OSDI'23
-
Fast and Efficient Model Serving Using Multi-GPUs with Direct-Host-Access Eurosys'23
-
Hidet: Task-Mapping Programming Paradigm for Deep Learning Tensor Programs.
-
MPCFormer: fast, performant, and private transformer inference with MPC ICLR'23
-
High-throughput Generative Inference of Large Language Modelwith a Single GPU
-
Cocktail: A Multidimensional Optimization for Model Serving in Cloud NSDI'22
-
Serving Heterogeneous Machine Learning Models on Multi-GPU Servers with Spatio-Temporal Sharing ATC'22
-
Abacus SC'21
-
Serving DNNs like Clockwork: Performance Predictability from the Bottom Up OSDI'20
-
Exploiting Cloud Services for Cost-Effective, SLO-Aware Machine Learning Inference Serving ATC'19
-
Nexus: a GPU cluster engine for accelerating DNN-based video analysis SOSP'19
-
MegaBlocks: Efficient Sparse Training with Mixture-of-Experts MLSYS'23
-
AutoMoE: Neural Architecture Search for Efficient Sparsely Activated Transformers
-
Lucid: A Non-Intrusive, Scalable and Interpretable Scheduler for Deep Learning Training Jobs ASPLOS'23
-
Shockwave: Fair and Efficient Cluster Scheduling for Dynamic Adaptation in Machine Learning NSDI'23
-
Synergy : Looking Beyond GPUs for DNN Scheduling on Multi-Tenant Clusters OSDI'22
-
Pollux: Co-adaptive Cluster Scheduling for Goodput-Optimized Deep Learning OSDI'21
-
Heterogeneity-Aware Cluster Scheduling Policies for Deep Learning Workloads OSDI'20
-
Chronus: A Novel Deadline-aware Scheduler for Deep Learning Training Jobs SOCC'21
-
ElasticFlow: An Elastic Serverless Training Platform for Distributed Deep Learning ASPLOS'23
-
Multi-Resource Interleaving for Deep Learning Training SIGCOMM'22
-
Slapo: A Schedule Language for Progressive Optimization of Large Deep Learning Model Training arxiv
-
Out-of-order backprop: an effective scheduling technique for deep learning Eurosys'22
-
KungFu: Making Training in Distributed Machine Learning Adaptive OSDI'20
-
PipeSwitch: Fast Pipelined Context Switching for Deep Learning Applications OSDI'20
-
Spada: Accelerating Sparse Matrix Multiplication with Adaptive Dataflow ASPLOS'23
-
MISO: Exploiting Multi-Instance GPU Capability on Multi-Tenant GPU Clusters SOCC'22
-
Accpar: Tensor partitioning for heterogeneous deep learning accelerators HPCA'20
-
Hidet: Task Mapping Programming Paradigm for Deep Learning Tensor Programs ASPLOS'23
-
iGniter: Interference-Aware GPU Resource Provisioning for Predictable DNN Inference in the Cloud TPDS'22
-
Efficient Quantized Sparse Matrix Operations on Tensor Cores SC'22
-
Pets ATC'22
-
PET: Optimizing Tensor Programs with Partially Equivalent Transformations and Automated Corrections OSDI'21
-
APNN-TC: Accelerating Arbitrary Precision Neural Networks on Ampere GPU Tensor Cores SC'21
-
iGUARD SOSP'21
-
Baechi: Fast Device Placement on Machine Learning Graphs SOCC'20
-
Data Movement Is All You Need: A Case Study on Optimizing Transformers
-
Legion: Automatically Pushing the Envelope of Multi-GPU System for Billion-Scale GNN Training ATC'23
-
TC-GNN: Accelerating Sparse Graph Neural Network Computation Via Dense Tensor Core on GPUs ATC'23
-
COGNN SC'22
-
TC-GNN: Accelerating Sparse Graph Neural Network Computation Via Dense Tensor Core on GPUs
-
GNNAdvisor: An Efficient Runtime System for GNN Acceleration on GPUs OSDI'21
-
Marius: Learning Massive Graph Embeddings on a Single Machine OSDI'21
-
Accelerating Large Scale Real-Time GNN Inference Using Channel Pruning VLDB'21
-
Reducing Communication in Graph Neural Network Training SC'20
- Fine-tuning giant neural networks on commodity hardware with automatic pipeline model parallelism ATC'21
-
Zeus: Understanding and Optimizing {GPU} Energy Consumption of {DNN} Training NSDI'23
-
EnvPipe: Performance-preserving DNN Training Framework for Saving Energy ATC'23
-
Characterizing Variability in Large-Scale, Accelerator-Rich Systems SC'22
-
Prediction of the Resource Consumption of Distributed Deep Learning Systems SIGMETRICS'22
We encourage all contributions to this repository. Open an issue or send a pull request.