flux-iac/tofu-controller

Tf-runner falls into CrashLoopBackOff state because of unknown flag --grpc-port for tofu-controller command

eneiss opened this issue · 2 comments

Hello!

I'm having an issue with the tf-runner Pod created by tf-controller (version v0.16.0-rc.4).
I deployed the tf-controller using Flux as a HelmRelease, as specified in the docs.
It can be worth mentioning that all my Helm/Docker registries are internal mirrors, but they all use the public v0.16.0-rc.4 version of the tf-controller Helm chart and Docker images.

When I create a "tf-test" Terraform resource in the git repository targeted by a GitRepository resource inside my cluster hosting Flux, the tf-controller running on it creates a "tf-test-tf-runner" Pod, but this Pod falls into error/CrashLoopBackOff state because of an unknown flag: --grpc-port error (full error log below).

It seems like the tf-controller is creating a runner Pod with an incorrect CLI flag on the tofu-controller command, which is explicitly specified as an arg of the tf-runner container (see details below).
Unfortunately I did not find any value to override in the tf-controller Helm chart to prevent this behavior.

Please let me know if I missed something (I'm still new to Flux) or if you need additional details, and thank you for your time :)

Additional information:

tf-test-tf-runner logs with the error:

unknown flag: --grpc-port
Usage of tofu-controller:
      --allow-break-the-glass                     Allow break the glass mode.
      --allow-cross-namespace-refs                Enable following cross-namespace references. Overrides --no-cross-namespace-
      --ca-cert-validity-duration duration        The duration that the ca certificate certificates should be valid for. Defau
      --cert-rotation-check-frequency duration    The interval that the mTLS certificate rotator should check the certificate
      --cert-validity-duration duration           (Deprecated) The duration that the mTLS certificate that the runner pod shou
      --cluster-domain string                     The cluster domain used by the cluster. (default "cluster.local")
      --concurrent int                            The number of concurrent terraform reconciles. (default 4)
      --enable-leader-election                    Enable leader election for controller manager. Enabling this will ensure the
      --events-addr string                        The address of the events receiver.
      --health-addr string                        The address the health endpoint binds to. (default ":9440")
      --http-retry int                            The maximum number of retries when failing to fetch artifacts over HTTP. (de
      --kube-api-burst int                        The maximum burst queries-per-second of requests sent to the Kubernetes API.
      --kube-api-qps float32                      The maximum queries-per-second of requests sent to the Kubernetes API. (defa
      --leader-election-lease-duration duration   Interval at which non-leader candidates will wait to force acquire leadershi
      --leader-election-release-on-cancel         Defines if the leader should step down voluntarily on controller manager shu
      --leader-election-renew-deadline duration   Duration that the leading controller manager will retry refreshing leadershi
      --leader-election-retry-period duration     Duration the LeaderElector clients should wait between tries of actions (dur
      --log-encoding string                       Log encoding format. Can be 'json' or 'console'. (default "json")
      --log-level string                          Log verbosity level. Can be one of 'trace', 'debug', 'info', 'error'. (defau
      --metrics-addr string                       The address the metric endpoint binds to. (default ":8080")
      --no-cross-namespace-refs                   When set to true, references between custom resources are allowed only if th
      --requeue-dependency duration               The interval at which failing dependencies are reevaluated. (default 30s)
      --runner-creation-timeout duration          Timeout for creating a runner pod. (default 2m0s)
      --runner-grpc-max-message-size int          The maximum message size for gRPC connections in MiB. (default 4)
      --runner-grpc-port int                      The port which will be exposed on the runner pod for gRPC connections. (defa
      --use-pod-subdomain-resolution              Allow to use pod hostname/subdomain DNS resolution instead of IP based
      --watch-all-namespaces                      Watch for custom resources in all namespaces, if set to false it will only w
unknown flag: --grpc-port

tf-test-tf-runner Pod description:

Name:             tf-test-tf-runner
Namespace:        flux-system
Priority:         0
Service Account:  tf-runner
[...]
Labels:           app.kubernetes.io/created-by=tf-controller
                  app.kubernetes.io/instance=tf-runner-c81aeb3f
                  app.kubernetes.io/name=tf-runner
                  infra.contrib.fluxcd.io/terraform=flux-system
                  tf.weave.works/tls-secret-name=terraform-runner.tls-1717010935
[...]
Containers:
  tf-runner:
    Container ID:    containerd://9b449ac3c8b376b15504a2fbc0176b9ee26a717fb30da858c2c6c770ac775731
    Image:           [INTERNAL-REGISTRY]/flux-iac/tofu-controller:v0.16.0-rc.4
    Image ID:        [INTERNAL-REGISTRY]/flux-iac/tofu-controller@sha256:850888287bdf3429a8d20e791c74356d4b8210041227c26a70d40b51c0abdf79
    Port:            30000/TCP
    Host Port:       0/TCP
    SeccompProfile:  RuntimeDefault
    Args:
      --grpc-port
      30000
      --tls-secret-name
      terraform-runner.tls-1717010935
      --grpc-max-message-size
      4
    State:          Waiting
      Reason:       CrashLoopBackOff
    Last State:     Terminated
      Reason:       Error
      Exit Code:    2
      Started:      Tue, 28 May 2024 20:05:46 +0000
      Finished:     Tue, 28 May 2024 20:05:46 +0000

Versions of CNI, Flux and tf-controller

$ helm list -A
NAME                            NAMESPACE       REVISION        UPDATED                                 STATUS          CHART                        APP VERSION
cilium                          kube-system     103             2024-05-28 20:02:09.914929235 +0000 UTC deployed        cilium-1.15.5                1.15.5
flux                            flux-system     131             2024-05-28 20:02:17.205621316 +0000 UTC deployed        flux2-2.13.0                 2.3.0
flux-system-tf-controller       flux-system     2               2024-05-28 19:28:06.002703723 +0000 UTC deployed        tf-controller-v0.16.0-rc.4   v0.16.0-rc.4

User-supplied values of the tf-controller Helm chart deployed in my cluster:

allowBreakTheGlass: true
awsPackage:
  install: false
caCertValidityDuration: 24h
certRotationCheckFrequency: 30m
concurrency: 8
image:
  repository: [INTERNAL-REGISTRY]/flux-iac/tofu-controller
  tag: v0.16.0-rc.4
replicaCount: 1
resources:
  limits:
    cpu: 1000m
    memory: 2Gi
  requests:
    cpu: 400m
    memory: 64Mi
runner:
  image:
    repository: [INTERNAL-REGISTRY]/flux-iac/tofu-controller
    tag: v0.16.0-rc.4

Hello @eneiss,

Thank you for the detailed issue it helps a lot! It seems that you are using the tofu-controller image when you should be using the tf-runner image.

Change your values to:

...
runner:
  image:
    repository: [INTERNAL-REGISTRY]/flux-iac/tf-runner
    tag: v0.16.0-rc.4

Please let me know if you are still experiencing issues :)

Oops, nice catch! Thanks a lot for the help :)