This project containerizes the SLURM HPC job scheduler in a Kubernetes containers orchestrator. You can find an (almost) ready-to-use Kubernetes recipe here. This repository contains a YAML file to instantiate the containerized HPC cluster in Kubernetes and Dockerfiles (and their related files) for building images.
You have to create a dedicated namespace to welcome your containerized HPC cluster:
root@admin:~# kubectl create namespace hpc-nico
namespace/hpc-nico created
You must allow read access to pods properties that run in the namespace created in step 1. A rbac.yaml file is supplied and you can apply it:
root@admin:~# kubectl apply -f rbac.yaml
clusterrole.rbac.authorization.k8s.io/pods-list-hpc-nico created
clusterrolebinding.rbac.authorization.k8s.io/pods-list-hpc-nico created
A sample file slurm.yaml is supplied. It instanciates a Slurmctld service with two Slurmd nodes, each with 2 CPUs. You can apply the slurm.yaml file:
root@admin:~# kubectl apply -f slurm.yaml
service/nodes created
statefulset.apps/hpc-node created
statefulset.apps/control-node created
In this file you may want to customize three attributes:
spec:
selector:
matchLabels:
app: slurmd
serviceName: "nodes"
replicas: 2
For that, check the last line replicas: 2.
containers:
- name: slurmd
image: nyk0/slurmcontainer
volumeMounts:
- mountPath: /run/munge
name: sock
- mountPath: /locate
name: locate
resources:
limits:
cpu: "2"
requests:
cpu: "2"
You must adapt two parameters limits / cpu and requests / cpu to use exactly the correct CPU amount (here, 2 vCPUs).
To access the SLURM images you may need to customize the section ImagPullSecrets of each pod:
imagePullSecrets:
- name: regcred
You can find the steps to include your Docker Hub credentials in Kubernetes here.
List pods in the dedicated namespace:
root@admin:~# kubectl get pods -n hpc-nico
NAME READY STATUS RESTARTS AGE
control-node-0 3/3 Running 0 81s
hpc-node-0 2/2 Running 0 81s
hpc-node-1 2/2 Running 0 74s
You have to go in slurmctld container that belongs to the control-node-0 pod:
root@admin:~# kubectl exec -n hpc-nico -ti control-node-0 -c slurmctld -- /bin/bash
root@control-node-0:/#
And now you can su to a common user account included in custom slurmctld image:
root@control-node-0:/# su - nico
nico@control-node-0:~$
Finally, display the SLURM topology:
nico@control-node-0:~$ sinfo
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
COMPUTE* up infinite 2 idle hpc-node-[0-1]
It's OK, we get two nodes waiting for jobs. You can now run a very simple job with srun command:
nico@control-node-0:~$ srun -N 2 hostname
hpc-node-0
hpc-node-1
SLURM images include OpenMPI and a testing code in C; you can find it (and run it from) the test user home directory :
nico@control-node-0:~$ srun -n 4 ./pi 256
Elapsed time = 0.000006 seconds
Pi is approximately 3.1875000000000000, Error is 0.0459073464102069
We currently have two containerized HPC nodes:
nico@control-node-0:~$ sinfo
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
COMPUTE* up infinite 2 idle hpc-node-[0-1]
They both respond:
nico@control-node-0:~$ srun -N 2 hostname
hpc-node-0
hpc-node-1
Patch the stateful set to move from 2 replicas to 3 replicas:
root@admin:~# kubectl patch statefulsets hpc-node -n hpc-nico -p '{"spec":{"replicas":3}}'
statefulset.apps/hpc-node patched
After few seconds, our containerized cluster scaled-up:
nico@control-node-0:~$ sinfo
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
COMPUTE* up infinite 3 idle hpc-node-[0-2]
We can run a simple job:
nico@control-node-0:~$ srun -N 3 hostname
hpc-node-0
hpc-node-1
hpc-node-2
We can also run an MPI job to check that communication on the scaled containerized HPC cluster work:
nico@control-node-0:~$ srun -n 6 ./pi 256
Elapsed time = 0.000005 seconds
Pi is approximately 3.0000000000000000, Error is 0.1415926535897931
We have this containerized HPC cluster ressources (2 nodes with 2 CPUs each):
nico@control-node-0:~$ srun -N 2 hostname
hpc-node-0
hpc-node-1
Let's submit a job too large for our current set of ressources:
nico@control-node-0:~$ srun -N 3 hostname
srun: Requested partition configuration not available now
srun: job 10 queued and waiting for resources
Now, we scale up to 3 replicas:
root@admin:~# kubectl patch statefulsets hpc-node -n hpc-nico -p '{"spec":{"replicas":3}}'
statefulset.apps/hpc-node patched
As soon as the new replica joins our containerized HPC cluster, the pending job fails with these messages:
srun: job 10 has been allocated resources
srun: error: fwd_tree_thread: can't find address for host hpc-node-2, check slurm.conf
srun: error: Task launch for StepId=10.0 failed on node hpc-node-2: Can't find an address, check slurm.conf
srun: error: Application launch failed: Can't find an address, check slurm.conf
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
hpc-node-0
hpc-node-1
srun: error: Timed out waiting for job step to complete
If you re-run the command, it works. The reason is that if a pending job relies on srun and is scheduled on the new coming HPC node, it will fail. You can find more references here. This feature will be fully supported in the 23.02 version of SLURM. You can find the roadmap here (slide "Truly Dynamic Nodes").