PENG Zhenghao
December 5, 2020
git clone https://github.com/pengzhenghao/use-ray-with-slurm.git
cd use-ray-with-slurm
# Please make sure you have install ray first!
python launch.py --exp-name test --command "echo 1"
# or a RLLib task:
python launch.py --exp-name test --command "rllib train --run PPO --env CartPole-v0"
The above command will launch a ray cluster within slurm cluster with 1 computing node.
Concretely, the launch.py
does the following things:
- It automatically writes your requirements, e.g. number of CPUs, GPUs per node, the number of nodes and so on,
to a sbatch script name
{exp-name}_{date}-{time}.sh
. In the above example, it istest_1205-1132.sh
. Your command (--command
) to launch your own job is also written into the sbatch script. - Then it will submit the sbatch script to slurm manager via a new process.
- Finally, the python process will terminate itself and leaves a log file named
{exp-name}_{date}-{time}.log
to record the progress of your submitted command.
If you want to utilize multiple computing node in slurm and let ray recognizes them, please use:
python launch.py --exp-name test --command "python your_file.py" --num-nodes 3
If you want to specify the computing nodes, just use the same node name in sinfo
command:
python launch.py --exp-name test --command "python your_file.py" --num-nodes 3 --node chpc-cn[003-005]
--exp-name
: The experiment name. Will generate{exp-name}_{date}-{time}.sh
and{exp-name}_{date}-{time}.log
.--command
: The command you wish to run. For example:rllib train XXX
orpython XXX.py
.--num-gpus
: The number of GPUs you wish to use in each computing node. Default: 0.--node
(-w
): The specify nodes you wish to use, in the same form of the return ofsinfo
. Automatically assign if not specify.--num-nodes
(-n
): The number of nodes you wish to use. Default: 1.--partition
(-p
): The partition you wish to use. Default: "chpc" (CUHK cluster partition name, change to yours!)--load-env
: The command to setup your environment. For example:module load cuda/10.1
. Default: "".
The sbatch script does the following things:
- It fetches the list of computing nodes and their IP addresses.
- It launches a head ray process in one of the node, and get the address of the head node.
- It launches ray processes in (n-1) worker nodes and connects them to the head node by providing the head node address.
- It submit the user specified task to ray.
Since all n nodes have launched their own ray processes, and they are all connected to the head node's ray process, ray cluster will perform resources allocation as in other cluster.
- It works well with ray 1.0.0, feel free to open issue if you find it doesn't work.
- Feel free to copy the script to your own projects.
- This script is compatible with both IPV4 and IPV6 ip address of the computing nodes.
- This project is inspired by Yet Another Slurm Python Interface and Ray sbatch submission scripts used at NERSC.