GoogleCloudPlatform/cluster-toolkit

GKE node pool upgrade settings are not configurable

Closed this issue · 3 comments

Describe the bug

In toolkit, GKE node pool upgrade settings are hardcoded:

upgrade_settings {
strategy = "SURGE"
max_surge = 0
max_unavailable = 1
}

This stops us from efficiently upgrading nodes in-place. Without change, each node upgrade can take up 9+ minutes, which makes maintaining big node pools unrealistic.

Steps to reproduce

Steps to reproduce the behavior:

  1. Trigger in-place node pool upgrade

Expected behavior

You have an option to make sure multiple nodes are made unavailable time to minimize the downtime.

Actual behavior

You don't have any option to upgrade more than one node at a time.

Quick question, you are okay with SURGE strategy , but want to configure more than one node unavailable at a time ?

yes. In our particular case, being able to set higher value for max_unavailable while keeping everything else as is, would be enough.

Could you cherrypick #3359 into experimental branch so we can use with GKE provision?