yylin1/papers-notebook-with-scheduling

碩士論文文獻筆記（Deep Learning、Scheduling、Distributed、Kubernetes）

Papers Notebook 文獻筆記

此專案記錄自己閱讀論文蒐集過程，並希望透過閱讀過程記錄研究方法與重點，其整理Papers分類幫助自己正確朝著研究方向深入探討。論文主題以基於kubernetes 與 Scheduling。

Keywords Shortcuts:

AS: Auto Scaling
DL: Deep Learning
DS: Distributed System
NE: Network Efficient
RM: Resource Management
RU: Resource Utilization
RC: Resource Contention
RS: Resource Scheduling
DMLCS: Distributed Machine Learning Centralized Scheduling
PA: Performance Analysis
PT: Parallelized Training

目錄

排程 Scheduler
kubernetes

排程 Scheduler

Keywords	Paper Title	PDF	Slide	Year
DL, Scheduling	Gandiva: Introspective Cluster Scheduling for Deep Learning	[pdf]	[slide]	2018
DL, CPU, RS	Scheduling CPU for GPU-based Deep Learning Jobs	[pdf]	[slide]	2018
DL, NE, Scheduling	DLTAP: A Network-efficient Scheduling Method for Distributed Deep Learning Workload in Containerized Cluster Environment	[pdf]	[slide]	2018
DL,Training System	Project Adam: Building an Efficient and Scalable Deep Learning Training System	[pdf]	[Video]	2014
DL, PS, Rack-Scale	Parameter Hub: a Rack-Scale Parameter Server for Distributed Deep Neural Network Training	[pdf]	[slide]	2018
ML, PS	Scaling Distributed Machine Learning with the Parameter Server	[pdf]	[slide]	2014
ML, Infra	Applied Machine Learning at Facebook:A Datacenter Infrastructure Perspective	[pdf]	[slide]	2014
RM	Optimus: An Efficient Dynamic Resource Scheduler for Deep Learning Cluster	[pdf]	[slide]	2018
DS, PS	Scaling Distributed Machine Learning with the Parameter Server	[pdf]	[slide][Video]	2014
Scheduling, GPU, PA, RC	Topology-Aware GPU Scheduling for Learning Workloads in Cloud Environments	[pdf]	[slide]	2017
DL, RO, Job Scheduling, Autoscaling	DRAGON: A Dynamic Scheduling and Scaling Controller for Managing Distributed Deep Learning Jobs in Kubernetes Cluster	[pdf]	[slide]	2017

kubernetes

Keywords	Paper Title	PDF	Slide	Year
DL, AS, kubernetes	Deep Learning Based Auto-Scaling Load Balancing Mechanism for Distributed Software-Defined Storage Service	[pdf]	[slide]	2018
ML, benchmarking, kubernetes	Kubebench: A Benchmarking Platform for ML Workloads	[pdf]	[slide]	2018
RM, DMLCS,RU, kubernetes, kubeflow	GAI: A Centralized Tree-Based Scheduler for Machine Learning Workload in Large Shared Clusters	[pdf]	[slide]	2018
DL, Scheduling, Algorithm	Online Job Scheduling in Distributed Machine Learning Clusters	[pdf]	[slide]	2018
Autoscaling, kubernetes	Containers Orchestration with Cost-Efficient Autoscaling in Cloud Computing Environments	[pdf]	[slide]	2018
DL, PT, kubernetes	Parallelized Training of Deep NN – Comparison of Current Concepts and Frameworks	[pdf]	[slide]	2018

Other

Keywords	Paper Title	PDF	Slide	Year
DL, DS	Multi-tenant GPU Clusters for Deep Learning Workloads: Analysis and Implications	[pdf]	[slide]	2018
DL, DS	GeePS: Scalable deep learning on distributed GPUs with a GPU-specialized parameter server	[pdf]	[slide]	2015
DL	Poseidon: A system architecture for efficient GPU-based deep learning on multiple machines	[pdf]	[slide]	2015
Mesos, Marathon, Ceph	Toward High-Availability Container as a Service on Mesos Cluster with Distributed Shared Volumes	[pdf]	[slide]	2015
DL, System	Scalable Deep Learning on Distributed Infrastructures: Challenges, Techniques and Tools	[pdf]	[slide]	2019

Classic [排程(Scheduler)]

Paper Direction

Traditional scheduling architecture
Machine learning Distributed Cluster
- Model training
- Farmwork
- Parameters Server / AllReduce
Combination of both

Ref-Link

分佈式機器學習/ 深度學習論文整理

Learning Scheduler

Scheulder affinity
Scheduler Policy
Hardware GPU topology
Kube-batch

Operator Learning