A curated list of FL systems-related academic papers, articles, tutorials, slides and projects. Star this repository, and then you can keep abreast of the latest developments of this booming research field.
- Orca: A Distributed Serving System for Transformer-Based Generative Models | OSDI 22
- FrugalGPT: How to Use Large Language Models While Reducing Cost and Improving Performance | Stanford
- Fast Distributed Inference Serving for Large Language Models | Peking University
- Response Length Perception and Sequence Scheduling: An LLM-Empowered LLM Inference Pipeline | NUS
- Efficiently Scaling Transformer Inference | MLSys' 23
- Flover: A Temporal Fusion Framework for Efficient Autoregressive Model Parallel Inference
- FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness
- Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism
- Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM
- Reducing Activation Recomputation in Large Transformer Models
- DeepSpeed Inference : Enabling Efficient Inference of Transformer Models at Unprecedented Scale.
- FlexGen: High-throughput Generative Inference of Large Language Models with a Single GPU | UCB
- S3: Increasing GPU Utilization during Generative Inference for Higher Throughput
- Scissorhands: Exploiting the Persistence of Importance Hypothesis for LLM KV Cache Compression at Test Time
- AttMemo: Accelerating Self-Attention with Memoization on Big Memory Systems
- vLLM: Easy, Fast, and Cheap LLM Serving with PagedAttention | SOSP' 23
- Tabi: An Efficient Multi-Level Inference System for Large Language Models | EuroSys' 23
- TurboTransformers: An Efficient GPU Serving System For Transformer Models
- Inference with Reference: Lossless Acceleration of Large Language Models
- H2O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models
- SkipDecode: Autoregressive Skip Decoding with Batching and Caching for Efficient LLM Inferencex
- Full Stack Optimization of Transformer Inference: a Survey
- Optimized Network Architectures for Large Language Model Training with Billions of Parameters | UCB
- MPCFormer : fast, performant, and private transformer inference with MPC | ICLR'23
- INFaaS: Automated Model-less Inference Serving | ATC’ 21
- Alpa : Automating Inter- and Intra-Operator Parallelism for Distributed Deep Learning | OSDI' 22
- Pathways : Asynchronous Distributed Dataflow for ML | MLSys' 22
- AlpaServe: Statistical Multiplexing with Model Parallelism for Deep Learning Serving
- DeepSpeed-MoE: Advancing Mixture-of-Experts Inference and Training to Power Next-Generation AI Scale ICML' 2022.
- ZeRO-Offload : Democratizing Billion-Scale Model Training.
- ZeRO-Infinity : Breaking the GPU Memory Wall for Extreme Scale Deep Learning
- ZeRO : memory optimizations toward training trillion parameter models.
- Band: Coordinated Multi-DNN Inference on Heterogeneous Mobile Processors | MobiSys ’22
- Serving Heterogeneous Machine Learning Models on Multi-GPU Servers with Spatio-Temporal Sharing | ATC'22
- Fast and Efficient Model Serving Using Multi-GPUs with Direct-Host-Access | Eurosys'23
- Cocktail: A Multidimensional Optimization for Model Serving in Cloud | NSDI'22
- Merak: An Efficient Distributed DNN Training Framework with Automated 3D Parallelism for Giant Foundation Models
- SHEPHERD : Serving DNNs in the Wild
- Efficient GPU Kernels for N:M-Sparse Weights in Deep Learning
- AutoScratch: ML-Optimized Cache Management for Inference-Oriented GPUs
- ZeRO++: Extremely Efficient Collective Communication for Giant Model Training
- Channel Permutations for N:M Sparsity | MLSys' 23
- Welder : Scheduling Deep Learning Memory Access via Tile-graph | OSDI' 23
- Optimizing Dynamic Neural Networks with Brainstorm | OSDI'23
- ModelKeeper: Accelerating DNN Training via Automated Training Warmup | NSDI'23
- LLM Energy Leaderboard | Umich
- Aviary Explorer | Anyscale
- Open LLM Leaderboard | HuggingFace
- HELM | Stanford
- LMSYS | UCB