This repository was a trial of setting up a slurm cluster on a GitHub Actions runner using juju and the OSD slurm
bundle.
It now works, and you can inspect how to use it in .github/workflows/main.yaml.
However, I decided against using it further for several reasons:
- as from what I can see you always have to run every slurm command (
srun
,sbatch
,sacct
,squeue
, etc.) on the--unit slurmctld/0
, but from within a juju command, so for example:
juju run --unit slurmctld/0 'srun --partition=osd-slurmd -vvvv echo "hello world"'
Thus, any use of this would need to at least set an alias for every slurm command, so for the above something like (but this then collides with the quoting that is necessary around the 'srun ...'
command, that is necessary to avoid parsing arguments as juju arguments):
alias srun='juju run --unit slurmctld/0 srun'
-
This setup uses quite a lot of disk space in the process, and even after a dedicated cleanup step included in this example workflow here, 3GB of disk space are still blocked (compared to before the cluster setup).
-
While the OSD
slurm
bundle has detailed documentation (see below), there are some nitty gritty details missing, where I had to dig into juju documentation or figure things out by trial-and-error. In addition, juju documentation is not very accessible and there isn't much community docs / usage ofjuju
. For example, only very few stackoverflow Q&As and almost no blogs or working example repositories on GitHub about it. So understanding how to do things in juju is not straightforward. -
Versioning of the OSD
slurm
bundle repository and of the respectivejuju
charms on charmhub.io is murky, at best. So pinning stuff to ensure you have a working version of something is not easily done.
Using Omnivector Solutions Slurm Distribution (OSD) slurm-bundle to set up a slurm cluster in GitHub Actions via juju. Resources used are:
- Omnivector Solutions documentation for OSD
- OSD
slurm
bundle oncharmhub.io
- OSD
slurm
bundle repository - rpository of individual charms for the OSD
slurm
bundle - juju documentation
Generally, Ansible
seems to be what most people use, and has much more uptake than juju
.
So Ansible
roles for slurm
are probably the best place to start looking.
At the time of writing, these three stick out:
- StackHPC Ansible slurm setup: it is open, builds on top of
OpenStack
and comes with quite a bit of documentation. - The sciCORE at the Univerity of Basel has an Ansible role for slurm setup , and has nice-looking documentation for it.