yichuan520030910320/MLsys_reading_list

A record of reading list on some MLsys popular topic

MLsys_reading_list

Fault tolerance

Mitigating Stragglers in the Decentralized Training on Heterogeneous Clusters
CPR: Understanding and Improving Failure Tolerant Training for Deep Learning Recommendation with Partial Recovery
Efficient Replica Maintenance for Distributed Storage Systems
GEMINI: Fast Failure Recovery in Distributed Training with In-Memory Checkpoints SOSP 23 RICE
https://www.abstractsonline.com/pp8/#!/10856/presentation/9287
Elastic Averaging for Efficient Pipelined DNN Training
Understanding the Effects of Permanent Faults in GPU's Parallelism Management and Control Units | Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis
Straggler-Resistant Distributed Matrix Computation via Coding Theory: Removing a Bottleneck in Large-Scale Data Processing
[1901.05162] Coded Matrix Multiplication on a Group-Based Model
Parity models: erasure-coded resilience for prediction serving systems
[2002.02440] A locality-based approach for coded computation
Straggler Mitigation in Distributed Optimization Through Data Encoding
Learning Effective Straggler Mitigation from Experience and Modeling
Straggler-Resistant Distributed Matrix Computation via Coding Theory: Removing a Bottleneck in Large-Scale Data Processing
Bamboo: Making Preemptible Instances Resilient for Affordable Training of Large DNNs NSDI23 UCLA
Varuna: Scalable, Low-cost Training of Massive Deep Learning Models Eurosys21 MSR
Swift: Expedited Failure Recovery for Large-scale DNN Training PPoPP23 HKU
Oobleck: Resilient Distributed Training of Large Models Using Pipeline Templates SOSP23 Umich

LLM Program

system

CUHK https://arxiv.org/pdf/2407.00326 Optimize LLM application

New Recommendation system

Actions Speak Louder than Words: Trillion-Parameter Sequential Transducers for Generative Recommendations (HSTU Meta new recsys linear and attention mixed)

MCTS powered LLM

Algorithmn

MCTSr: https://arxiv.org/html/2406.07394v1 Code: https://github.com/trotsky1997/MathBlackBox
Rest MCTS ReST-MCTS*: LLM Self-Training via Process Reward Guided Tree Search THU NIPS24 code: https://github.com/THUDM/ReST-MCTS
SC-MCTS https://arxiv.org/abs/2410.01707 new lol THU

Diffusion Model System

Model

1.DDPM https://arxiv.org/abs/2006.11239 cannot skip timesteps

2.DDIM https://arxiv.org/abs/2010.02502 can skip timesteps

3.Latent Diffusion Models https://github.com/CompVis/latent-diffusion in latent space rather than pixel space

4.[NSDI24] Approximate Caching for Efficiently Serving Diffusion Models https://arxiv.org/abs/2312.04429

5.LoRA for diffusion model parameter efficient finetune

6.ControlNet https://arxiv.org/abs/2302.05543

7.Video Diffusion Model https://arxiv.org/abs/2204.03458 3D unet More: https://github.com/showlab/Awesome-Video-Diffusion