GoogleCloudPlatform/cluster-toolkit

Unable to dynamically modify the number of nodes in a slurm cluster

nischith-sarvam opened this issue · 2 comments

If you find a similar existing issue, please comment on that issue instead of creating a new one.

If you are submitting a feature request, please start a discussion instead of creating an issue.

Describe the bug

We tried expanding the number of nodes in the slurm cluster by increasing the number of static nodes variable in the yml file from 14 to 16. But running this yml file did not create 2 new nodes. What is the way to dynamically increase and decrease the number of static nodes in the slurm cluster.

Steps to reproduce

Steps to reproduce the behavior:

  1. Run the blueprint
  2. Increase the number of nodes in the yml file attached
  3. Run the blueprint again
  4. Observe

Expected behavior

Expected the number of nodes in the slurm cluster to increase

Actual behavior

No new nodes were created

Version (ghpc --version)

ghpc version v1.22.1
Built from 'main' branch.
Commit info: v1.22.1-0-g27f24fc5

Blueprint

If applicable, attach or paste the blueprint YAML used to produce the bug.

configuration_cluster_setup.yml.txt

Expanded Blueprint

If applicable, please attach or paste the expanded blueprint. The expanded blueprint can be obtained by running ghpc expand your-blueprint.yaml.

Disregard if the bug occurs when running ghpc expand ... as well.

Output and logs


Screenshots

If applicable, add screenshots to help explain your problem.

Execution environment

Cloud shell

Additional context

Add any other context about the problem here.

Hi,

With the current slurm-on-gpc, resizing partitions through the YAML requires enable_reconfigure to be set to true. In fact we recommend that one sets the following three variables in the top level vars block:

  enable_reconfigure: true
  enable_cleanup_compute: true
  enable_cleanup_subscriptions: true

Enabling these options require certain python dependencies in the machine deploying the cluster (see the warning at the top of the controller documentation).

Since you cluster doesn't seem to have that option enabled, there is no safe way of reconfiguring through the yaml without recreating the cluster. However, one can still change the partitions by editing the slurm.conf file in the controller and then restarting the slurmctld daemon. It is not as easy as editing the yaml, but I think it is your only alternative, unless you redeploy this cluster with the above options enabled. See this guide for more information, it was written by SchedMD for an older version of Slurm but still remains mostly relevant.

Keep us posted!

Hi,

I will be closing this issue now. Please let me know if we can assist in other way of if there are more questions.

Thanks!