Feature: Terraform Templates Overrride

Question

Feature: Terraform Templates Overrride

Despire opened this issue 6 months ago · 0 comments

Problem:

How TF works - if you do a breaking change in TF files, TF replaces the resources. In our case, if the nodes are replaced (e.g. due to an Ubuntu image update), the nodes are destroyed by Terraform and then they're re-created. However, after re-creation, they're not part of the K8s cluster. The old nodes have not been gracefully removed from the cluster either, therefore the control plane sees them as NotReady and the users can't recover from this manually.
Now, we need to be able to change Terraform code for nodepools. This is needed in case of upgrading the OS version in the TF files or in case of adding new features that come with breaking changes in TF HCL code.

Proposal:

Ideally, we should find a way to deliver breaking changes in TF via rolling updates. That is, we create a new nodepool using the new TF code and then remove the old nodepool built using the old TF code. This way the cluster can sustain such updates without downtime. One way how to achieve this would be the following.
We'll separate the TF code into a new repository. This would be a GitHub/berops repository by default, but we allow users to override it. We also enable users to specify the repository tag and commit hash.
Unless the tag has been overridden by the user, the repository tag will match the running Claudie release version (e.g. 0.8.1).
If the tag is specified, Claudie will use the TF code from the particular commit with the tag.
If the tag has not been specified and Claudie defaults to the tag matching the current Claudie release version, upon an upgrade, Claudie will automatically do a rolling update by deploying the nodepool with the same configuration and TF code from the new release (e.g. 0.8.2).
However, if the tag was specified by the user to 0.8.1, Claudie upgrade to 0.8.2 would do no change to the nodepool infra, as the user pin of the TF code has a higher priority

Effects:

Unless the user pins a specific configuration repository tag, Claudie will perform rolling updates for TF code updates.
If the TF code didn't change between Claudie versions, it's perfectly OK to have the same commit having multiple tags (e.g. 0.8.1, 0.8.2, 0.9.0); in that case, Claudie will be smart and won't trigger a rolling update.
This allows users to override TF configuration by forking our upstream configuration repository and introducing their changes.

Open questions:

Design API for introducing git repo with TF templates (CRD/ConfigMap/InputManifest).
Is the TF code repository going to be used also for the future repo of the kuber manifest overrides?
Probably a suitable way to introduce this would be by rewriting TF templates to TF modules, because we can nicely control the TF module URL and TF module version from main.tf/main.tpl. To be discussed.
Sometimes, we have a shared TF code per provider instance (e.g. Hetzner nodepools share the same SSH keys). If we say that for each provider we have TF code that is shared, or "singleton" and code that is tied to nodepools, then we inherently introduce a problem for upgrading the TF singleton code (a way would be to remove the provider A nodepools completely, replace it with other provider B, upgrade Claudie and re-introduce provider A nodepools afterwards using the new TF code). Alternatively, do we say we don't want any "singleton" TF code and everything will be instantiated on the nodepool TF code level? (e.g. in Hetzner, two nodepools cannot share the SSH keys, therefore we'd need to have SSH keys per nodepool).