flux-iac/tofu-controller

Terraform jobs failing with status "Reconciliation in progress" on private GKE cluster

Closed this issue · 6 comments

Today, my lab environment on a private GKE cluster began experiencing an issue. All 30+ Terraform jobs in the cluster are failing with the same issue, even though we didn't make any changes.

Here are the details:

tf-controller: v0.15.0
Kubernetes: v1.24.15-gke.1700
Flux: v2.0.0-rc.5
Weave GitOps: v0.26.0

All my Terraform jobs are failing with the status "Reconciliation in progress." I have tried the following troubleshooting steps:

  • Deleted the tf-controller POD enforcing a restart: no change
  • Suspended all jobs and run only one: no change
  • Enabled trace loglevel in tf-controller POD: attached logs
    tf-controller-trace-loglevel.tar.gz
  • Run break-the-glass in this POD: no change
    I have also tried to see the output of the runner execution with logging enabled by setting the following vars:
DISABLE_TF_LOGS=0
ENABLE_SENSITIVE_TF_LOGS="1"

However, the runner does not start, and I do not know how to open the logs to learn more about this issue.

Error from server (BadRequest): container "tf-runner" in pod "zone-vols.cat-tf-runner" is waiting to start: ContainerCreating
2023/10/26 16:56:34 Starting the runner... version  sha 
Error from server (NotFound): pods "zone-vols.cat-tf-runner" not found
Error from server (NotFound): pods "zone-vols.cat-tf-runner" not found
Error from server (BadRequest): container "tf-runner" in pod "zone-vols.cat-tf-runner" is waiting to start: ContainerCreating
Error from server (BadRequest): container "tf-runner" in pod "zone-vols.cat-tf-runner" is waiting to start: ContainerCreating
Error from server (BadRequest): container "tf-runner" in pod "zone-vols.cat-tf-runner" is waiting to start: ContainerCreating
2023/10/26 16:58:56 Starting the runner... version  sha 
rpc error: code = NotFound desc = an error occurred when try to find container "6f75c5325dbd5a38d0b074fe29beb1d3f811d67fc598eebeb715eda574a65754": not foundError from server (NotFound): pods "zone-vols.cat-tf-runner" not found
Error from server (NotFound): pods "zone-vols.cat-tf-runner" not found
Error from server (NotFound): pods "zone-vols.cat-tf-runner" not found
Error from server (BadRequest): container "tf-runner" in pod "zone-vols.cat-tf-runner" is waiting to start: ContainerCreating
Error from server (BadRequest): container "tf-runner" in pod "zone-vols.cat-tf-runner" is waiting to start: ContainerCreating
Error from server (BadRequest): container "tf-runner" in pod "zone-vols.cat-tf-runner" is waiting to start: ContainerCreating
2023/10/26 17:05:24 Starting the runner... version  sha 

Could you please help me troubleshoot this issue?

I have had this issue in the past before also. Some of the things I have had todo are:

  1. Delete the tf-controller pod and the tf-runner pods
  2. Suspend/restart the tf resource
  3. A combination of the above

unfortunately like you its frustrating when the controller things its in progress, but the runner pod seems to be waiting for something.

GKE cloud dns resolution support has been released in v0.16.0-rc.3. Please upgrade and refer to the docs here: https://weaveworks.github.io/tf-controller/getting_started/#installation-on-gke

Oh thank you @chanwit ! Good point, let me try on monday and I will apply your fix, do you know if there are any breaking change? Let me add I think this upgrade will fix me my environment. I will also upgrade Flux2 up to 2.1.0. For other hand, do you have some plans to delivers soon 0.16.0 or 0.16.0-rc4?

Yes, there is a breaking change in v0.16.0-rc.3. Please refer to the change log.

thank you @chanwit ! I will do it! 😀

Hi,

Thank you @chanwit for your support these 2 last days, I couldn't fix without your support!

  1. After upgrading in my lab environment, I saw an error like this from some Terraform with dots in the name. Basically, the tf-controller couldn't create runner POD. I don't know if the constraint was for k8s or the controller. My workaround has been to remove the jobs and create a new one fixing the name of the job, this allows the tf-controller to create runner POD due to the name of the POD is the same as Terraform Kind. Let me suggest it would be helpful to specify the name of the runner POD via helm in order to avoid this relationship between POD runner name and the Terraform Job. I didn't find this you helm chart, but maybe I didn't find it well. Anyway, let me suggest the people like me who use dots in the name of the Terraform Kinds.

{"level":"error","ts":"2023-10-30T12:57:35.362Z","msg":"Reconciler error","controller":"terraform","controllerGroup":"infra.contrib.fluxcd.io","controllerKind":"Terraform","Terraform":{"name":"zone-vols.cat","namespace":"fluxcd-cloudflare"},"namespace":"fluxcd-cloudflare","name":"zone-vols.cat","reconcileID":"b630a85f-bdff-4c7c-901d-50ccc3e21812","error":"Pod "zone-vols.cat-tf-runner" is invalid: spec.hostname: Invalid value: "zone-vols.cat": a lowercase RFC 1123 label must consist of lower case alphanumeric characters or '-', and must start and end with an alphanumeric character (e.g. 'my-name', or '123-abc', regex used for validation is 'a-z0-9?')"}

  1. I think (maybe I'm not sure....) we should update the doc https://weaveworks.github.io/tf-controller/getting_started/#installation-on-gke specifying the namespaces with this helm value Values.runner.serviceAccount.allowedNamespaces instead of Values.runner.allowedNamespaces as the documentation suggested. I delivered the tf-controller with terraform and helm and I wrote my code just to work with
    Values.runner.serviceAccount.allowedNamespaces. I found it here