The below documentation was written as a roadmap for future implementation, and may be irrelevant/inaccurate to how Latte is operated today. Reach out to the VP of tech (vp@csua.berkeley.edu) or root staff (via Discord) if you have any questions.
About Latte
Latte is a GPU server, donated in part by NVIDIA Corp. for use by the CS community. It features 8 datacenter-class NVIDIA Tesla P100 GPUs, which offer a large speedup for machine learning and related GPU computing tasks. The Tensorflow and PyTorch libraries are available for use as well.
User Guide
Getting Started
To begin using latte
, you need to first have a CSUA account and be a member of the ml2018
group. You can check if you are a member by logging into soda.csua.berkeley.edu
and using the id
command.
To get a CSUA account, please visit our office in 311 Soda and an officer will create an account for you.
To get into the ml2018
group, send an email to latte@csua.berkeley.edu with the following:
- Name
- CSUA Username
- Intended use
Once we receive your email, we will give you access to the group.
Once you have an account, you can log into latte.csua.berkeley.edu
over SSH. This will bring you into the slurmctld
machine. From here, you can begin setting up your jobs.
Testing Your Jobs
slurmctld
is meant for testing only. There are limits to the amount of compute you can use while in this machine.
The /datasets/
directory has some publicly-available datasets to use in /datasets/share/
. If you are using your own dataset, please place them in /datasets/
because the contents of /home/
are mounted over a network filesystem and will be slower.
Once you run your program and it works, you can submit a job.
Running Your Jobs
Slurm is used to manage the job scheduling on latte
.
To run a job, you need to submit it using the srun
command. You can read about how to use Slurm here.
This will send the job to one of the GPU nodes and run the job.
Contact
If you have any questions, please email latte@csua.berkeley.edu.
Developer Guide
This repo contains the configurations used to test and deploy the slurm docker cluster known as latte
. The important commands can be found in the contents of Makefile
.
The cluster is created using docker-compose
, specifically using nvidia-docker-compose
. There are a number of other pieces of software involved, however.
How docker-compose works
(Copied from https://docs.docker.com/compose/overview/ )
Compose is a tool for defining and running multi-container Docker applications. With Compose, you use a YAML file to configure your application’s services. Then, with a single command, you create and start all the services from your configuration.
Using Compose is basically a three-step process:
-
Define your app’s environment with a
Dockerfile
so it can be reproduced anywhere. -
Define the services that make up your app in
docker-compose.yml
so they can be run together in an isolated environment. -
Run
docker-compose up
and Compose starts and runs your entire app.
About the Makefile
The Makefile
describes all the necessary commands for building and testing the cluster.
Slurm Docker Cluster (Documentation from original Repo)
This is a multi-container Slurm cluster using docker-compose. The compose file creates named volumes for persistent storage of MySQL data files as well as Slurm state and log directories.
Containers and Volumes
The compose file will run the following containers:
mysql
slurmdbd
slurmctld
c1 (slurmd)
c2 (slurmd)
The compose file will create the following named volumes:
- etc_munge ( -> /etc/munge )
- etc_slurm ( -> /etc/slurm )
- slurm_jobdir ( -> /data )
- var_lib_mysql ( -> /var/lib/mysql )
- var_log_slurm ( -> /var/log/slurm )
Building the Docker Image
Build the image locally:
$ docker build -t slurm-docker-cluster:17.02.9 .
Starting the Cluster
Run docker-compose
to instantiate the cluster:
$ docker-compose up -d
Register the Cluster with SlurmDBD
To register the cluster to the slurmdbd daemon, run the register_cluster.sh
script:
$ ./register_cluster.sh
Note: You may have to wait a few seconds for the cluster daemons to become ready before registering the cluster. Otherwise, you may get an error such as sacctmgr: error: Problem talking to the database: Connection refused.
You can check the status of the cluster by viewing the logs:
docker-compose logs -f
Accessing the Cluster
Use docker exec
to run a bash shell on the controller container:
$ docker exec -it slurmctld bash
From the shell, execute slurm commands, for example:
[root@slurmctld /]# sinfo
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
normal* up 5-00:00:00 2 idle c[1-2]
Submitting Jobs
The slurm_jobdir
named volume is mounted on each Slurm container as /data
.
Therefore, in order to see job output files while on the controller, change to
the /data
directory when on the slurmctld container and then submit a job:
[root@slurmctld /]# cd /data/
[root@slurmctld data]# sbatch --wrap="uptime"
Submitted batch job 2
[root@slurmctld data]# ls
slurm-2.out
Stopping and Restarting the Cluster
$ docker-compose stop
$ docker-compose start
Deleting the Cluster
To remove all containers and volumes, run:
$ docker-compose rm -sf
$ docker volume rm slurmdockercluster_etc_munge slurmdockercluster_etc_slurm slurmdockercluster_slurm_jobdir slurmdockercluster_var_lib_mysql slurmdockercluster_var_log_slurm