/Liquid

Intelligent Resource Requirement Estimation and Scheduling for Deep Learning Jobs on Distributed GPU Clusters

Primary LanguagePythonApache License 2.0Apache-2.0

Liquid

This is the code repository for the deep learning job scheduling paper titled 'Liquid: Intelligent Resource Requirement Estimation and Scheduling for Deep Learning Jobs on Distributed GPU Clusters'.

The project is based on Docker.

Prerequisites

  • OS Centos Linux release7.6.1810
  • Nvidia Driver 410.129
  • CUDA 10.0
  • Docker 19.03
  • Nvidia-docker 2.2.2

Steps to bring up the Liquid components

Init a docker swarm cluster

# on master node
docker swarm init

# Add other nodes to the cluster
docker swarm join --token A-LONG-TOKEN-STRING-HERE 192.168.0.1:2377
docker swarm leave
docker swarm leave --force

Create an overlay network named yao

docker network create --driver overlay --attachable yao-net

# docker network create --driver overlay --attachable --opt encrypted yao-net

Note: try remove encrypted when the containers cannot communicate cross nodes

Start HDFS cluster (Optional)

Liquid-docs/sbin/run_hdfs.sh

Start GlusterFS cluster (Optional)

Liquid-docs/sbin/run_glusterfs.sh

Start the agents in each Liquid-Worker

Liquid-docs/sbin/run_agent_helper.sh

Liquid-docs/sbin/run_agent.sh

Start the agent-master on Liquid-Master

Liquid-docs/sbin/start_agent_master.sh

Start mysql

Liquid-docs/sbin/start_mysql.sh

Start Liquid-optimizer on Master Node

Liquid-docs/sbin/run_optimizer.sh

Start Liquid-scheduler

Liquid-docs/sbin/start_scheduler.sh

Start Redis

Liquid-docs/sbin/start_redis.sh

Start the web portal

Liquid-docs/sbin/start_portal.sh

Start gitea

Liquid-docs/sbin/start_gitea.sh

Visit http://YOUR_IP/install.php