/MLsys_reading_list

A record of reading list on some MLsys popular topic

MLsys_reading_list

Fault tolerance

  1. Mitigating Stragglers in the Decentralized Training on Heterogeneous Clusters
  2. CPR: Understanding and Improving Failure Tolerant Training for Deep Learning Recommendation with Partial Recovery
  3. Efficient Replica Maintenance for Distributed Storage Systems
  4. GEMINI: Fast Failure Recovery in Distributed Training with In-Memory Checkpoints SOSP 23 RICE
  5. https://www.abstractsonline.com/pp8/#!/10856/presentation/9287
  6. Elastic Averaging for Efficient Pipelined DNN Training
  7. Understanding the Effects of Permanent Faults in GPU's Parallelism Management and Control Units | Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis
  8. Straggler-Resistant Distributed Matrix Computation via Coding Theory: Removing a Bottleneck in Large-Scale Data Processing
  9. [1901.05162] Coded Matrix Multiplication on a Group-Based Model
  10. Parity models: erasure-coded resilience for prediction serving systems
  11. [2002.02440] A locality-based approach for coded computation
  12. Straggler Mitigation in Distributed Optimization Through Data Encoding
  13. Learning Effective Straggler Mitigation from Experience and Modeling
  14. Straggler-Resistant Distributed Matrix Computation via Coding Theory: Removing a Bottleneck in Large-Scale Data Processing
  15. Bamboo: Making Preemptible Instances Resilient for Affordable Training of Large DNNs NSDI23 UCLA
  16. Varuna: Scalable, Low-cost Training of Massive Deep Learning Models Eurosys21 MSR
  17. Swift: Expedited Failure Recovery for Large-scale DNN Training PPoPP23 HKU
  18. Oobleck: Resilient Distributed Training of Large Models Using Pipeline Templates SOSP23 Umich

LLM Program

system

  1. CUHK https://arxiv.org/pdf/2407.00326 Optimize LLM application

New Recommendation system

  1. Actions Speak Louder than Words: Trillion-Parameter Sequential Transducers for Generative Recommendations (HSTU Meta new recsys linear and attention mixed)

MCTS powered LLM

Algorithmn

  1. MCTSr: https://arxiv.org/html/2406.07394v1 Code: https://github.com/trotsky1997/MathBlackBox

  2. Rest MCTS ReST-MCTS*: LLM Self-Training via Process Reward Guided Tree Search THU NIPS24 code: https://github.com/THUDM/ReST-MCTS

  3. SC-MCTS https://arxiv.org/abs/2410.01707 new lol THU

Diffusion Model System

Model

1.DDPM https://arxiv.org/abs/2006.11239 cannot skip timesteps

2.DDIM https://arxiv.org/abs/2010.02502 can skip timesteps

3.Latent Diffusion Models https://github.com/CompVis/latent-diffusion in latent space rather than pixel space

4.[NSDI24] Approximate Caching for Efficiently Serving Diffusion Models https://arxiv.org/abs/2312.04429

5.LoRA for diffusion model parameter efficient finetune

6.ControlNet https://arxiv.org/abs/2302.05543

7.Video Diffusion Model https://arxiv.org/abs/2204.03458 3D unet More: https://github.com/showlab/Awesome-Video-Diffusion