GoogleCloudPlatform/cluster-toolkit

Using a newer version of Terraform can lead to controller replacement on reconfigure for Slurm GCP v6

Closed this issue · 3 comments

Describe the bug

For most changes, when reconfiguring a Slurm GCP v6 cluster, the controller should not be destroyed. This allows for operations such as adding a partition to be done on a running cluster.

It has been discovered that newer versions of Terraform introduce a bug such that these reconfigure operations cause the controller to be destroyed and recreated instead of being updated in place.

This can be very disruptive to a running cluster as state may be lost such as current queue and accounting information.

Terraform version 1.5 is known to have the intended behavior. Terraform version 1.7 is known to exhibit the bad behavior. This bug is caused by a change in behavior of how Terraform treats drift between state and Terraform code.

The maintainers of this repository are aware of this issue and working to implement a shot and long term fix. If your workflow includes reconfiguring running Slurm GCP v6 clusters, please be advised to not upgrade beyond Terraform 1.5 until this bug has been addressed.

Steps to reproduce

Steps to reproduce the behavior:

  1. Install terraform 1.7
  2. Deploy examples/hpc-slurm.yaml
  3. Add a partition to the blueprint
  4. Re-deploy the blueprint (ghpc deploy -w)

Expected behavior

Without impacting queue, accounting db, or running jobs the partition is added to the cluster. The controller vm is modified in place and is not deleted.

Actual behavior

The controller is deleted and a new controller VM is created.

Blueprint

Any Slurm GCP v6 blueprint.

Running into this while doing the (official!) slurm-on-gcp tutorial (hpc toolkit)

./ghpc create examples/hpc-slurm.yaml -l ERROR --vars project_id=personal-235500
validator "test_tf_version_for_slurm" failed:
Error: using a newer version of Terraform can lead to controller replacement on reconfigure for Slurm GCP v6

Please be advised of this known issue: https://github.com/GoogleCloudPlatform/hpc-toolkit/issues/2774
Until resolved it is advised to use Terraform 1.4.0 with Slurm deployments.

To silence this warning, add flag: --skip-validators=test_tf_version_for_slurm

One or more blueprint validators has failed. See messages above for suggested
actions. General troubleshooting guidance and instructions for configuring
validators are shown below.

- https://goo.gle/hpc-toolkit-troubleshooting
- https://goo.gle/hpc-toolkit-validation

Validators can be silenced or treated as warnings or errors:

- https://goo.gle/hpc-toolkit-validation-levels

@mr0re1 has fixed this on develop. It will be included in the next release.

This issue is stale because it has been open for 30 days with no activity.