ionos-cloud/terraform-provider-ionoscloud

"ionoscloud_k8s_node_pool" "node_count" should not be required if "auto_scaling" is configured

mway-niels opened this issue · 4 comments

Current Provider Version

v6.3.1

Use-cases

We're running a Managed Kubernetes cluster on IONOS cloud. We recently upgraded the Kubernetes version from 1.23.9 to 1.24.9. The update itself completed without any issues.

However, as part of the update procedure IONOS downscaled our node pools to the value defined in node_count, which is the same as auto_scaling.min_node_count in our configuration. This caused downtime for the workloads running in the cluster as they couldn't be scheduled due to missing resources (nodes). Removing node_count isn't possible as it is a required variable.

Proposal

Update the Terraform provider implementation to make node_count optional if auto_scaling is configured. During an update, the node groups should keep the current node_count as their desired node_count and should not be forced to scale to auto_scaling.min_node_count nodes.

@mway-niels I understand the proposal, but I don't understand the reasons behind it, can you please clarify some things for me, please?

We recently upgraded the Kubernetes version from 1.23.9 to 1.24.9. The update itself completed without any issues.

I assume that the upgrade was done by modifying the k8s version in the tf plan, please confirm or correct me if I'm wrong.

as part of the update procedure IONOS downscaled our node pools to the value defined in node_count, which is the same as auto_scaling.min_node_count in our configuration.

From what you wrote here, I understand that the final number of nodes (after the upgrade) was node_count from the tf plan, but then, you wrote this:

During an update, the node groups should keep the current node_count as their desired node_count and should not be forced to scale to auto_scaling.min_node_count nodes.

From this it seems that, after the update, the number of nodes was auto_scaling.min_node_count. It's a bit confusing, I don't understand what number of nodes you had in the end, after the update.

I'm not sure I understood what the problem is, but, from what you wrote, I think it's the following:

After the upgrade, the number of nodes wasn't node_count, as you expected, but auto_scaling.min_node_count, which would also motivate what you wrote here: "This caused downtime for the workloads running in the cluster as they couldn't be scheduled due to missing resources (nodes).".

Please tell me if I understood correctly or not. If not, please try to clarify and maybe add other additional details such as the expected number of nodes after the upgrade or anything else that may be useful, maybe something like this for example:

Configuration:

auto_scaling {
    min_node_count      = 1
    max_node_count      = 4
}
node_count = 2

Problem
After the upgrade, the node_count was 1, which was unexpected.

Expected behavior
After the upgrade, the number of nodes should be node_count = 2, not auto_scaling.min_node_count.

The explanations that contain the configurations as well as the values for the number of nodes you expected vs the real values will help me better understand what the problem is.

Apologies, let me clarify:

Our configuration (shortened for readability):

resource "ionoscloud_k8s_cluster" "example_k8s" {
  k8s_version = "1.24.9" # Changed from 1.23.9
  maintenance_window {
    ...
  }
}

resource "ionoscloud_k8s_node_pool" "example_k8s_nodepool" {
  k8s_cluster_id = ionoscloud_k8s_cluster.example_k8s.id
  maintenance_window {
    ...
  }
  datacenter_id = var.ionoscloud_datacenter_id
  lans {
    ...
  }
  k8s_version       = ionoscloud_k8s_cluster.example_k8s.k8s_version # Changed from 1.23.9
  node_count        = var.min_node_count # = 3
  auto_scaling {
    min_node_count = var.min_node_count # = 3
    max_node_count = var.max_node_count # = 7
  }
}

Timeline:

  1. The cluster is running with 2 node pools, configured as mentioned above. One node pool has 4 desired nodes, the other has 5.
  2. We change the k8s_version from 1.23.9 to 1.24.9.
  3. Our pipeline executes terraform plan and terraform apply:
# module.k8s_production[0].ionoscloud_k8s_cluster.example_k8s will be updated in-place
~ resource "ionoscloud_k8s_cluster" "example_k8s" {
      id                        = "XXXXX"
    ~ k8s_version               = "1.23.9" -> "1.24.9"
      # (1 unchanged attribute hidden)
      # (1 unchanged block hidden)
  }
# module.k8s_production[0].ionoscloud_k8s_node_pool.example_k8s_nodepool[0] will be updated in-place
~ resource "ionoscloud_k8s_node_pool" "example_k8s_nodepool" {
      id                = "XXXXX"
    ~ k8s_version       = "1.23.9" -> "1.24.9"
    ~ node_count        = 4 -> 3
      # (11 unchanged attributes hidden)
      # (4 unchanged blocks hidden)
  }
# module.k8s_production[0].ionoscloud_k8s_node_pool.example_k8s_nodepool[1] will be updated in-place
~ resource "ionoscloud_k8s_node_pool" "example_k8s_nodepool" {
      id                = "XXXXX"
    ~ k8s_version       = "1.23.9" -> "1.24.9"
    ~ node_count        = 5 -> 3
      # (11 unchanged attributes hidden)
      # (4 unchanged blocks hidden)
  }
  1. After starting terraform apply the node pools are scaled down accordingly.
  2. IONOS replaces the nodes one by one (1.23.9 -> 1.24.9).
  3. When the node pool has finished updating it is scaled up according to the Kubernetes scheduler. There are downtimes during steps 5, 6 and until 7 is finished.

@mway-niels thank you! Now, it is clear, the following things happen:

  1. You provision your infrastructure using node_count = 3 since this is the initial value from the tf plan.
  2. The Kubernetes scheduler modifies the number of nodes, let's say that the number of nodes will be 4.
  3. Now, the number of nodes in the node pool will be 4, but the number of nodes written in the tf plan and saved in the state file is 3, so Terraform will think of this as a change that should be made. If, at this moment, we run terraform plan we will see something like:
resource "ionoscloud_k8s_node_pool" "example" {
        id                = "952becd7-57f5-4133-864b-22cb17fd06c9"
        name              = "k8sNodePoolExample"
      ~ node_count        = 4 # VALUE FROM THE API -> 3 # VALUE FROM TF PLAN

We have 2 solutions to avoid this:

  1. Use:
lifecycle {
    ignore_changes = [
      node_count
    ]
  }

Inside the ionoscloud_k8s_node_pool resource. As the name says, this will ignore the changes for node_count, more details here. Keep in mind that, if you choose this solution, you will need to remove ignore_changes = [node_count] if you want to update the node_count value from the tf plan.

  1. In the tf plan, when you modify the version, you can also modify the node_count value to match the new one, the one set by the scheduler, 4, in our case.

I opted to use proposed solution 1 as manual node_count changes are unlikely since we're using an autoscaling configuration.