
碩士論文文獻筆記(Deep Learning、Scheduling、Distributed、Kubernetes)

Papers Notebook 文獻筆記

此專案記錄自己閱讀論文蒐集過程,並希望透過閱讀過程記錄研究方法與重點,其整理Papers分類幫助自己正確朝著研究方向深入探討。論文主題以基於kubernetes 與 Scheduling。

Keywords Shortcuts:

  • AS: Auto Scaling
  • DL: Deep Learning
  • DS: Distributed System
  • NE: Network Efficient
  • RM: Resource Management
  • RU: Resource Utilization
  • RC: Resource Contention
  • RS: Resource Scheduling
  • DMLCS: Distributed Machine Learning Centralized Scheduling
  • PA: Performance Analysis
  • PT: Parallelized Training


排程 Scheduler

Keywords Paper Title PDF Slide Year
DL, Scheduling Gandiva: Introspective Cluster Scheduling for Deep Learning [pdf] [slide] 2018
DL, CPU, RS Scheduling CPU for GPU-based Deep Learning Jobs [pdf] [slide] 2018
DL, NE, Scheduling DLTAP: A Network-efficient Scheduling Method for Distributed Deep Learning Workload in Containerized Cluster Environment [pdf] [slide] 2018
DL,Training System Project Adam: Building an Efficient and Scalable Deep Learning Training System [pdf] [Video] 2014
DL, PS, Rack-Scale Parameter Hub: a Rack-Scale Parameter Server for Distributed Deep Neural Network Training [pdf] [slide] 2018
ML, PS Scaling Distributed Machine Learning with the Parameter Server [pdf] [slide] 2014
ML, Infra Applied Machine Learning at Facebook:A Datacenter Infrastructure Perspective [pdf] [slide] 2014
RM Optimus: An Efficient Dynamic Resource Scheduler for Deep Learning Cluster [pdf] [slide] 2018
DS, PS Scaling Distributed Machine Learning with the Parameter Server [pdf] [slide][Video] 2014
Scheduling, GPU, PA, RC Topology-Aware GPU Scheduling for Learning Workloads in Cloud Environments [pdf] [slide] 2017
DL, RO, Job Scheduling, Autoscaling DRAGON: A Dynamic Scheduling and Scaling Controller for Managing Distributed Deep Learning Jobs in Kubernetes Cluster [pdf] [slide] 2017


Keywords Paper Title PDF Slide Year
DL, AS, kubernetes Deep Learning Based Auto-Scaling Load Balancing Mechanism for Distributed Software-Defined Storage Service [pdf] [slide] 2018
ML, benchmarking, kubernetes Kubebench: A Benchmarking Platform for ML Workloads [pdf] [slide] 2018
RM, DMLCS,RU, kubernetes, kubeflow GAI: A Centralized Tree-Based Scheduler for Machine Learning Workload in Large Shared Clusters [pdf] [slide] 2018
DL, Scheduling, Algorithm Online Job Scheduling in Distributed Machine Learning Clusters [pdf] [slide] 2018
Autoscaling, kubernetes Containers Orchestration with Cost-Efficient Autoscaling in Cloud Computing Environments [pdf] [slide] 2018
DL, PT, kubernetes Parallelized Training of Deep NN – Comparison of Current Concepts and Frameworks [pdf] [slide] 2018


Keywords Paper Title PDF Slide Year
DL, DS Multi-tenant GPU Clusters for Deep Learning Workloads: Analysis and Implications [pdf] [slide] 2018
DL, DS GeePS: Scalable deep learning on distributed GPUs with a GPU-specialized parameter server [pdf] [slide] 2015
DL Poseidon: A system architecture for efficient GPU-based deep learning on multiple machines [pdf] [slide] 2015
Mesos, Marathon, Ceph Toward High-Availability Container as a Service on Mesos Cluster with Distributed Shared Volumes [pdf] [slide] 2015
DL, System Scalable Deep Learning on Distributed Infrastructures: Challenges, Techniques and Tools [pdf] [slide] 2019

Classic [排程(Scheduler)]

Paper Direction

  • Traditional scheduling architecture
  • Machine learning Distributed Cluster
    • Model training
    • Farmwork
    • Parameters Server / AllReduce
  • Combination of both


Learning Scheduler

  • Scheulder affinity
  • Scheduler Policy
  • Hardware GPU topology
  • Kube-batch

Operator Learning