/Storage-for-AI-Paper

Accelerating Deep Learning Training (DLT) from Storage Perspective

Storage for AI

Fetch & Preprocessing

[2022 SIGMOD] Where Is My Training Bottleneck? Hidden Trade-Offs in Deep Learning Preprocessing Pipelines. [PDF] [Recording]

[2022 ATC] Cachew: Machine Learning Input Data Processing as a Service. [PDF]

[2021 ATC] Refurbish Your Training Data: Reusing Partially Augmented Samples for Faster Deep Neural Network Training. [PDF] [Slides]

Preprocessing Stall: cache partially augmented samples across all epochs within a job

[2021 VLDB] Analyzing and Mitigating Data Stalls in DNN Training. [PDF]

Hyperparameter (HP) Search: stage preprocessed minibatch across all HP jobs within an epoch

[2020 FAST] Quiver: An Informed Storage Cache for Deep Learning. [PDF] [Slides]

Fetch Stall (Remote): share cached training data among multiple tasks

Checkpointing

[2022 NSDI] Check-N-Run: a Checkpointing System for Training Deep Learning Recommendation Models. [PDF] [Slides]

[2021 FAST] CheckFreq: Frequent, Fine-Grained DNN Checkpointing. [PDF] [Slides]

[2020 CCGRID] DeepFreeze: Towards Scalable Asynchronous Checkpointing of Deep Learning Models. [PDF]

Data Pipeline

[2021 VLDB] tf.data: A Machine Learning Data Processing Framework. [PDF]

Other

[2021 Ph.D. Dissertation] Accelerating Deep Learning Training : A Storage Perspective. [PDF]

Benchmark

[2020 MLSys] MLPerf Training Benchmark. [PDF]

[2021 Big Data Mining And Analytics] AIPerf: Automated Machine Learning as an AI-HPC Benchmark. [PDF]