此專案記錄自己閱讀論文蒐集過程,並希望透過閱讀過程記錄研究方法與重點,其整理Papers
分類幫助自己正確朝著研究方向深入探討。論文主題以基於kubernetes 與 Scheduling。
- AS: Auto Scaling
- DL: Deep Learning
- DS: Distributed System
- NE: Network Efficient
- RM: Resource Management
- RU: Resource Utilization
- RC: Resource Contention
- RS: Resource Scheduling
- DMLCS: Distributed Machine Learning Centralized Scheduling
- PA: Performance Analysis
- PT: Parallelized Training
Keywords |
Paper Title |
PDF |
Slide |
Year |
DL, Scheduling |
Gandiva: Introspective Cluster Scheduling for Deep Learning |
[pdf] |
[slide] |
2018 |
DL, CPU, RS |
Scheduling CPU for GPU-based Deep Learning Jobs |
[pdf] |
[slide] |
2018 |
DL, NE, Scheduling |
DLTAP: A Network-efficient Scheduling Method for Distributed Deep Learning Workload in Containerized Cluster Environment |
[pdf] |
[slide] |
2018 |
DL,Training System |
Project Adam: Building an Efficient and Scalable Deep Learning Training System |
[pdf] |
[Video] |
2014 |
DL, PS, Rack-Scale |
Parameter Hub: a Rack-Scale Parameter Server for Distributed Deep Neural Network Training |
[pdf] |
[slide] |
2018 |
ML, PS |
Scaling Distributed Machine Learning with the Parameter Server |
[pdf] |
[slide] |
2014 |
ML, Infra |
Applied Machine Learning at Facebook:A Datacenter Infrastructure Perspective |
[pdf] |
[slide] |
2014 |
RM |
Optimus: An Efficient Dynamic Resource Scheduler for Deep Learning Cluster |
[pdf] |
[slide] |
2018 |
DS, PS |
Scaling Distributed Machine Learning with the Parameter Server |
[pdf] |
[slide][Video] |
2014 |
Scheduling, GPU, PA, RC |
Topology-Aware GPU Scheduling for Learning Workloads in Cloud Environments |
[pdf] |
[slide] |
2017 |
DL, RO, Job Scheduling, Autoscaling |
DRAGON: A Dynamic Scheduling and Scaling Controller for Managing Distributed Deep Learning Jobs in Kubernetes Cluster |
[pdf] |
[slide] |
2017 |
Keywords |
Paper Title |
PDF |
Slide |
Year |
DL, AS, kubernetes |
Deep Learning Based Auto-Scaling Load Balancing Mechanism for Distributed Software-Defined Storage Service |
[pdf] |
[slide] |
2018 |
ML, benchmarking, kubernetes |
Kubebench: A Benchmarking Platform for ML Workloads |
[pdf] |
[slide] |
2018 |
RM, DMLCS,RU, kubernetes, kubeflow |
GAI: A Centralized Tree-Based Scheduler for Machine Learning Workload in Large Shared Clusters |
[pdf] |
[slide] |
2018 |
DL, Scheduling, Algorithm |
Online Job Scheduling in Distributed Machine Learning Clusters |
[pdf] |
[slide] |
2018 |
Autoscaling, kubernetes |
Containers Orchestration with Cost-Efficient Autoscaling in Cloud Computing Environments |
[pdf] |
[slide] |
2018 |
DL, PT, kubernetes |
Parallelized Training of Deep NN – Comparison of Current Concepts and Frameworks |
[pdf] |
[slide] |
2018 |
Keywords |
Paper Title |
PDF |
Slide |
Year |
DL, DS |
Multi-tenant GPU Clusters for Deep Learning Workloads: Analysis and Implications |
[pdf] |
[slide] |
2018 |
DL, DS |
GeePS: Scalable deep learning on distributed GPUs with a GPU-specialized parameter server |
[pdf] |
[slide] |
2015 |
DL |
Poseidon: A system architecture for efficient GPU-based deep learning on multiple machines |
[pdf] |
[slide] |
2015 |
Mesos, Marathon, Ceph |
Toward High-Availability Container as a Service on Mesos Cluster with Distributed Shared Volumes |
[pdf] |
[slide] |
2015 |
DL, System |
Scalable Deep Learning on Distributed Infrastructures: Challenges, Techniques and Tools |
[pdf] |
[slide] |
2019 |
- Traditional scheduling architecture
- Machine learning Distributed Cluster
- Model training
- Farmwork
- Parameters Server / AllReduce
- Combination of both
- Scheulder affinity
- Scheduler Policy
- Hardware GPU topology
- Kube-batch