flux-iac/tofu-controller

Terraform CRs stuck in Initializing status

Opened this issue · 1 comments

Running v0.015.1 version.

Problem

  • We see that couple CRs are stuck in Initializing state. The only way so far we've found so far to reconcile these stuck CRs is by restarting tf-controller pods. (We see this even when CPU and memory utilization on nodes is not too high)
  • When it is stuck in Initializing state, we see no logs from tf-controller on these CRs which indicates they are not being reconciled
  • tf- controller concurrency is at 192 and qps and burst are below
 kubeAPIBurst: 1000
 kubeAPIQPS: 500
  • Are there any metrics we should be looking to understand the load on the controller why this might be happening and what we can do to alleviate the problem ?

We have the formula to approximate the capacity here which would help:

#281