"ionoscloud_k8s_node_pool" "node_count" should not be required if "auto_scaling" is configured
mway-niels opened this issue · 4 comments
Current Provider Version
v6.3.1
Use-cases
We're running a Managed Kubernetes cluster on IONOS cloud. We recently upgraded the Kubernetes version from 1.23.9 to 1.24.9. The update itself completed without any issues.
However, as part of the update procedure IONOS downscaled our node pools to the value defined in node_count
, which is the same as auto_scaling.min_node_count
in our configuration. This caused downtime for the workloads running in the cluster as they couldn't be scheduled due to missing resources (nodes). Removing node_count
isn't possible as it is a required variable.
Proposal
Update the Terraform provider implementation to make node_count
optional if auto_scaling
is configured. During an update, the node groups should keep the current node_count
as their desired node_count
and should not be forced to scale to auto_scaling.min_node_count
nodes.
@mway-niels I understand the proposal, but I don't understand the reasons behind it, can you please clarify some things for me, please?
We recently upgraded the Kubernetes version from 1.23.9 to 1.24.9. The update itself completed without any issues.
I assume that the upgrade was done by modifying the k8s version in the tf
plan, please confirm or correct me if I'm wrong.
as part of the update procedure IONOS downscaled our node pools to the value defined in node_count, which is the same as auto_scaling.min_node_count in our configuration.
From what you wrote here, I understand that the final number of nodes (after the upgrade) was node_count
from the tf
plan, but then, you wrote this:
During an update, the node groups should keep the current node_count as their desired node_count and should not be forced to scale to auto_scaling.min_node_count nodes.
From this it seems that, after the update, the number of nodes was auto_scaling.min_node_count
. It's a bit confusing, I don't understand what number of nodes you had in the end, after the update.
I'm not sure I understood what the problem is, but, from what you wrote, I think it's the following:
After the upgrade, the number of nodes wasn't node_count
, as you expected, but auto_scaling.min_node_count
, which would also motivate what you wrote here: "This caused downtime for the workloads running in the cluster as they couldn't be scheduled due to missing resources (nodes)."
.
Please tell me if I understood correctly or not. If not, please try to clarify and maybe add other additional details such as the expected number of nodes after the upgrade or anything else that may be useful, maybe something like this for example:
Configuration:
auto_scaling {
min_node_count = 1
max_node_count = 4
}
node_count = 2
Problem
After the upgrade, the node_count
was 1, which was unexpected.
Expected behavior
After the upgrade, the number of nodes should be node_count = 2
, not auto_scaling.min_node_count
.
The explanations that contain the configurations as well as the values for the number of nodes you expected vs the real values will help me better understand what the problem is.
Apologies, let me clarify:
Our configuration (shortened for readability):
resource "ionoscloud_k8s_cluster" "example_k8s" {
k8s_version = "1.24.9" # Changed from 1.23.9
maintenance_window {
...
}
}
resource "ionoscloud_k8s_node_pool" "example_k8s_nodepool" {
k8s_cluster_id = ionoscloud_k8s_cluster.example_k8s.id
maintenance_window {
...
}
datacenter_id = var.ionoscloud_datacenter_id
lans {
...
}
k8s_version = ionoscloud_k8s_cluster.example_k8s.k8s_version # Changed from 1.23.9
node_count = var.min_node_count # = 3
auto_scaling {
min_node_count = var.min_node_count # = 3
max_node_count = var.max_node_count # = 7
}
}
Timeline:
- The cluster is running with 2 node pools, configured as mentioned above. One node pool has 4 desired nodes, the other has 5.
- We change the
k8s_version
from1.23.9
to1.24.9
. - Our pipeline executes
terraform plan
andterraform apply
:
# module.k8s_production[0].ionoscloud_k8s_cluster.example_k8s will be updated in-place
~ resource "ionoscloud_k8s_cluster" "example_k8s" {
id = "XXXXX"
~ k8s_version = "1.23.9" -> "1.24.9"
# (1 unchanged attribute hidden)
# (1 unchanged block hidden)
}
# module.k8s_production[0].ionoscloud_k8s_node_pool.example_k8s_nodepool[0] will be updated in-place
~ resource "ionoscloud_k8s_node_pool" "example_k8s_nodepool" {
id = "XXXXX"
~ k8s_version = "1.23.9" -> "1.24.9"
~ node_count = 4 -> 3
# (11 unchanged attributes hidden)
# (4 unchanged blocks hidden)
}
# module.k8s_production[0].ionoscloud_k8s_node_pool.example_k8s_nodepool[1] will be updated in-place
~ resource "ionoscloud_k8s_node_pool" "example_k8s_nodepool" {
id = "XXXXX"
~ k8s_version = "1.23.9" -> "1.24.9"
~ node_count = 5 -> 3
# (11 unchanged attributes hidden)
# (4 unchanged blocks hidden)
}
- After starting
terraform apply
the node pools are scaled down accordingly. - IONOS replaces the nodes one by one (1.23.9 -> 1.24.9).
- When the node pool has finished updating it is scaled up according to the Kubernetes scheduler. There are downtimes during steps 5, 6 and until 7 is finished.
@mway-niels thank you! Now, it is clear, the following things happen:
- You provision your infrastructure using
node_count = 3
since this is the initial value from thetf
plan. - The Kubernetes scheduler modifies the number of nodes, let's say that the number of nodes will be 4.
- Now, the number of nodes in the node pool will be 4, but the number of nodes written in the
tf
plan and saved in the state file is 3, soTerraform
will think of this as a change that should be made. If, at this moment, we runterraform plan
we will see something like:
resource "ionoscloud_k8s_node_pool" "example" {
id = "952becd7-57f5-4133-864b-22cb17fd06c9"
name = "k8sNodePoolExample"
~ node_count = 4 # VALUE FROM THE API -> 3 # VALUE FROM TF PLAN
We have 2 solutions to avoid this:
- Use:
lifecycle {
ignore_changes = [
node_count
]
}
Inside the ionoscloud_k8s_node_pool
resource. As the name says, this will ignore the changes for node_count
, more details here. Keep in mind that, if you choose this solution, you will need to remove ignore_changes = [node_count]
if you want to update the node_count
value from the tf
plan.
- In the
tf
plan, when you modify the version, you can also modify thenode_count
value to match the new one, the one set by the scheduler, 4, in our case.
I opted to use proposed solution 1 as manual node_count changes are unlikely since we're using an autoscaling configuration.