Terraform jobs failing with status "Reconciliation in progress" on private GKE cluster

Question

Terraform jobs failing with status "Reconciliation in progress" on private GKE cluster

Closed this issue a year ago · 6 comments

Today, my lab environment on a private GKE cluster began experiencing an issue. All 30+ Terraform jobs in the cluster are failing with the same issue, even though we didn't make any changes.

Here are the details:

tf-controller: v0.15.0
Kubernetes: v1.24.15-gke.1700
Flux: v2.0.0-rc.5
Weave GitOps: v0.26.0

All my Terraform jobs are failing with the status "Reconciliation in progress." I have tried the following troubleshooting steps:

Deleted the tf-controller POD enforcing a restart: no change
Suspended all jobs and run only one: no change
Enabled trace loglevel in tf-controller POD: attached logs
tf-controller-trace-loglevel.tar.gz
Run break-the-glass in this POD: no change
I have also tried to see the output of the runner execution with logging enabled by setting the following vars:

DISABLE_TF_LOGS=0
ENABLE_SENSITIVE_TF_LOGS="1"

However, the runner does not start, and I do not know how to open the logs to learn more about this issue.

Error from server (BadRequest): container "tf-runner" in pod "zone-vols.cat-tf-runner" is waiting to start: ContainerCreating
2023/10/26 16:56:34 Starting the runner... version  sha 
Error from server (NotFound): pods "zone-vols.cat-tf-runner" not found
Error from server (NotFound): pods "zone-vols.cat-tf-runner" not found
Error from server (BadRequest): container "tf-runner" in pod "zone-vols.cat-tf-runner" is waiting to start: ContainerCreating
Error from server (BadRequest): container "tf-runner" in pod "zone-vols.cat-tf-runner" is waiting to start: ContainerCreating
Error from server (BadRequest): container "tf-runner" in pod "zone-vols.cat-tf-runner" is waiting to start: ContainerCreating
2023/10/26 16:58:56 Starting the runner... version  sha 
rpc error: code = NotFound desc = an error occurred when try to find container "6f75c5325dbd5a38d0b074fe29beb1d3f811d67fc598eebeb715eda574a65754": not foundError from server (NotFound): pods "zone-vols.cat-tf-runner" not found
Error from server (NotFound): pods "zone-vols.cat-tf-runner" not found
Error from server (NotFound): pods "zone-vols.cat-tf-runner" not found
Error from server (BadRequest): container "tf-runner" in pod "zone-vols.cat-tf-runner" is waiting to start: ContainerCreating
Error from server (BadRequest): container "tf-runner" in pod "zone-vols.cat-tf-runner" is waiting to start: ContainerCreating
Error from server (BadRequest): container "tf-runner" in pod "zone-vols.cat-tf-runner" is waiting to start: ContainerCreating
2023/10/26 17:05:24 Starting the runner... version  sha

Could you please help me troubleshoot this issue?

Answer 1 · 2023-10-27T21:50:45.000Z

I have had this issue in the past before also. Some of the things I have had todo are:

Delete the tf-controller pod and the tf-runner pods
Suspend/restart the tf resource
A combination of the above

unfortunately like you its frustrating when the controller things its in progress, but the runner pod seems to be waiting for something.

Answer 2 · 2023-10-28T02:42:07.000Z

GKE cloud dns resolution support has been released in v0.16.0-rc.3. Please upgrade and refer to the docs here: https://weaveworks.github.io/tf-controller/getting_started/#installation-on-gke

Answer 3 · 2023-10-28T09:18:29.000Z

Oh thank you @chanwit ! Good point, let me try on monday and I will apply your fix, do you know if there are any breaking change? Let me add I think this upgrade will fix me my environment. I will also upgrade Flux2 up to 2.1.0. For other hand, do you have some plans to delivers soon 0.16.0 or 0.16.0-rc4?

Answer 4 · 2023-10-28T09:23:28.000Z

Yes, there is a breaking change in v0.16.0-rc.3. Please refer to the change log.

Answer 5 · 2023-10-28T13:07:50.000Z

thank you @chanwit ! I will do it! 😀

Answer 6 · 2023-10-31T12:51:42.000Z

Hi,

Thank you @chanwit for your support these 2 last days, I couldn't fix without your support!

After upgrading in my lab environment, I saw an error like this from some Terraform with dots in the name. Basically, the tf-controller couldn't create runner POD. I don't know if the constraint was for k8s or the controller. My workaround has been to remove the jobs and create a new one fixing the name of the job, this allows the tf-controller to create runner POD due to the name of the POD is the same as Terraform Kind. Let me suggest it would be helpful to specify the name of the runner POD via helm in order to avoid this relationship between POD runner name and the Terraform Job. I didn't find this you helm chart, but maybe I didn't find it well. Anyway, let me suggest the people like me who use dots in the name of the Terraform Kinds.

{"level":"error","ts":"2023-10-30T12:57:35.362Z","msg":"Reconciler error","controller":"terraform","controllerGroup":"infra.contrib.fluxcd.io","controllerKind":"Terraform","Terraform":{"name":"zone-vols.cat","namespace":"fluxcd-cloudflare"},"namespace":"fluxcd-cloudflare","name":"zone-vols.cat","reconcileID":"b630a85f-bdff-4c7c-901d-50ccc3e21812","error":"Pod "zone-vols.cat-tf-runner" is invalid: spec.hostname: Invalid value: "zone-vols.cat": a lowercase RFC 1123 label must consist of lower case alphanumeric characters or '-', and must start and end with an alphanumeric character (e.g. 'my-name', or '123-abc', regex used for validation is 'a-z0-9?')"}

I think (maybe I'm not sure....) we should update the doc https://weaveworks.github.io/tf-controller/getting_started/#installation-on-gke specifying the namespaces with this helm value Values.runner.serviceAccount.allowedNamespaces instead of Values.runner.allowedNamespaces as the documentation suggested. I delivered the tf-controller with terraform and helm and I wrote my code just to work with
Values.runner.serviceAccount.allowedNamespaces. I found it here