Dynamic scalability of clusters

Question

Dynamic scalability of clusters

Closed this issue a year ago · 4 comments

SLURM offers the option of elastic scaling (see https://slurm.schedmd.com/elastic_computing.html) and this can be leveraged to free up nodes that are not in use. This could save a lot of money on paid cloud resources since training clusters are likely to be idle for large parts of the day. Even when not paying for resources, it is potentially interesting to allow multiple clusters to effectively share the resources that make up their compute node pool.

The problem with this is that it requires SLURM to have the power to tell the cloud to start a node that has been shut down by SLURM due to being idle. This would mean opening the door to potentially allowing our cloud identity to be compromised (since the authentication would have to done from the server running SLURM...or you leverage a middle-man service like MC Hub to do the reboot). A number of providers seem to have options that might allow us to minimise the risk associated with having cloud access from the cluster:

OpenStack (from v3.13) allows one to create application credentials with access rules. Since this can be done in terraform, the credentials only live as long as the cluster. By leveraging the access rules we can restrict the credentials to a limited scope (hopefully /servers/{server_id}/action for only the list of {server_id}s that we will scale elastically)
For AWS, we could probably do something similar with Resource: aws_iam_user
For Azure, there are "Managed Identities" or "Service Principals" which might also work for this use case

Answer 1 · 2021-05-04T13:33:02.000Z

An example of suspend/resume elastic script was put together for Google at the beginning of Magic Castle. I am not sure how relevant it still is, but it could probably be a starting point once we figure the credentials aspect.

Here is the link to the elastic script and the required config:
https://gist.github.com/cmd-ntrf/f356cf761d3b20a926e2899e0aa46a73

Answer 2 · 2021-05-04T15:28:19.000Z

I was just checking what happens with OpenStack when you stop/start an image. In terms if your quota, nothing happens: a stopped image still consumes quota. If you use shelve/unshelve the instance is removed from the hypervisor (so probably helps your hosting site) but it also still consumes quota.

(With OVH, I checked and it seems only a shelved instance is not billed for compute resources)

Answer 3 · 2021-05-10T16:26:11.000Z

Thanks @ocaisa and @cmd-ntrf for exploring and documenting this! I'm very happy to have seen a concrete example of the kind of changes that could be made to make this possible.

Ping @andersy005 as we discussed this recently!

Answer 4 · 2021-07-16T11:49:56.000Z

@cmd-ntrf Mentioned that this could perhaps be implemented via Terraform Cloud (and it's API). I think this could definitely work but unfortunately restricting the permissions of the API (via Teams) is a paid feature of Terraform. You can use the API as the owner but that potentially leaves a door to a lot of different providers open (for example, I currently have 2 workspaces for different OpenStack instances and another for AWS...all of which include application secrets which you can't read but you can use).