- Mitigating Stragglers in the Decentralized Training on Heterogeneous Clusters
- CPR: Understanding and Improving Failure Tolerant Training for Deep Learning Recommendation with Partial Recovery
- Efficient Replica Maintenance for Distributed Storage Systems
- GEMINI: Fast Failure Recovery in Distributed Training with In-Memory Checkpoints SOSP 23 RICE
- https://www.abstractsonline.com/pp8/#!/10856/presentation/9287
- Elastic Averaging for Efficient Pipelined DNN Training
- Understanding the Effects of Permanent Faults in GPU's Parallelism Management and Control Units | Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis
- Straggler-Resistant Distributed Matrix Computation via Coding Theory: Removing a Bottleneck in Large-Scale Data Processing
- [1901.05162] Coded Matrix Multiplication on a Group-Based Model
- Parity models: erasure-coded resilience for prediction serving systems
- [2002.02440] A locality-based approach for coded computation
- Straggler Mitigation in Distributed Optimization Through Data Encoding
- Learning Effective Straggler Mitigation from Experience and Modeling
- Straggler-Resistant Distributed Matrix Computation via Coding Theory: Removing a Bottleneck in Large-Scale Data Processing
- Bamboo: Making Preemptible Instances Resilient for Affordable Training of Large DNNs NSDI23 UCLA
- Varuna: Scalable, Low-cost Training of Massive Deep Learning Models Eurosys21 MSR
- Swift: Expedited Failure Recovery for Large-scale DNN Training PPoPP23 HKU
- Oobleck: Resilient Distributed Training of Large Models Using Pipeline Templates SOSP23 Umich
- CUHK https://arxiv.org/pdf/2407.00326 Optimize LLM application
- Actions Speak Louder than Words: Trillion-Parameter Sequential Transducers for Generative Recommendations (HSTU Meta new recsys linear and attention mixed)
-
MCTSr: https://arxiv.org/html/2406.07394v1 Code: https://github.com/trotsky1997/MathBlackBox
-
Rest MCTS ReST-MCTS*: LLM Self-Training via Process Reward Guided Tree Search THU NIPS24 code: https://github.com/THUDM/ReST-MCTS
-
SC-MCTS https://arxiv.org/abs/2410.01707 new lol THU
1.DDPM https://arxiv.org/abs/2006.11239 cannot skip timesteps
2.DDIM https://arxiv.org/abs/2010.02502 can skip timesteps
3.Latent Diffusion Models https://github.com/CompVis/latent-diffusion in latent space rather than pixel space
4.[NSDI24] Approximate Caching for Efficiently Serving Diffusion Models https://arxiv.org/abs/2312.04429
5.LoRA for diffusion model parameter efficient finetune
6.ControlNet https://arxiv.org/abs/2302.05543
7.Video Diffusion Model https://arxiv.org/abs/2204.03458 3D unet More: https://github.com/showlab/Awesome-Video-Diffusion