Liquid

This is the code repository for the deep learning job scheduling paper titled 'Liquid: Intelligent Resource Requirement Estimation and Scheduling for Deep Learning Jobs on Distributed GPU Clusters'.

The project is based on Docker.

Prerequisites

OS Centos Linux release7.6.1810
Nvidia Driver 410.129
CUDA 10.0
Docker 19.03
Nvidia-docker 2.2.2

Steps to bring up the Liquid components

Init a docker swarm cluster

# on master node
docker swarm init

# Add other nodes to the cluster
docker swarm join --token A-LONG-TOKEN-STRING-HERE 192.168.0.1:2377
docker swarm leave
docker swarm leave --force

Create an overlay network named `yao`

docker network create --driver overlay --attachable yao-net

# docker network create --driver overlay --attachable --opt encrypted yao-net

Note: try remove encrypted when the containers cannot communicate cross nodes

Start HDFS cluster (Optional)

Liquid-docs/sbin/run_hdfs.sh

Start GlusterFS cluster (Optional)

Liquid-docs/sbin/run_glusterfs.sh

Start the agents in each Liquid-Worker

Liquid-docs/sbin/run_agent_helper.sh

Liquid-docs/sbin/run_agent.sh

Start the agent-master on Liquid-Master

Liquid-docs/sbin/start_agent_master.sh

Start mysql

Liquid-docs/sbin/start_mysql.sh

Start Liquid-optimizer on Master Node

Liquid-docs/sbin/run_optimizer.sh

Start Liquid-scheduler

Liquid-docs/sbin/start_scheduler.sh

Start Redis

Liquid-docs/sbin/start_redis.sh

Start the web portal

Liquid-docs/sbin/start_portal.sh

Start gitea

Liquid-docs/sbin/start_gitea.sh

Visit http://YOUR_IP/install.php

PasaLab/Liquid