mittwald/kubernetes-replicator

kubernetes-replicator pod crashes when updating secrets

sravanakinapally opened this issue · 9 comments

Describe the bug
kubernetes-replicator pod crashes when replicating single secret across 150+ namespaces

To Reproduce
Installed using helm

          helm repo add mittwald https://helm.mittwald.de
          helm upgrade --version v2.6.3 --install kubernetes-replicator mittwald/kubernetes-replicator --namespace kubernetes-replicator

Expected behavior
pod should not crash so that all secrets are replicated to all namespaces

Environment:

  • Kubernetes version: [1.22]
  • kubernetes-replicator version: [2.6.3]

Additional context
Additional details about pod termination

  • Increased pod resource quotas but still pod crashed
  • Increased replica count to 3 but all 3 bods crashed
  • Updated to latest version replicator 2.7.3 still pod crashed
      terminated:
        exitCode: 2
        finishedAt: "2022-09-13T21:23:52Z"
        reason: Error
        startedAt: "2022-09-13T21:14:14Z"
    name: kubernetes-replicator
    ready: false
    restartCount: 1

This is the message after the pod restart

kubernetes-replicator-768465d6d7-4mx78:kubernetes-replicator time="2022-09-13T21:24:09Z" level=error msg="could not replicate object to other namespaces" error="Replicated kubernetes-replicator/xyz.com.registry.creds to 70 out of 155 namespaces

There has not been any activity to this issue in the last 14 days. It will automatically be closed after 7 more days. Remove the stale label to prevent this.

Any update is appreciated

Apologies for the delay. Do you have any logs available from when the controller crashed? Those would help isolate the issue, and also to determine if this is the same issue as #214.

These are the logs from pod

level=error msg="could not replicate object to other namespaces" error="Replicated kubernetes-replicator/xxxxxxxx.xxxxx.creds to 157 out of 178 namespaces: 21 errors occurred:\n\t*

and pod crashes with this error

    lastState:
      terminated:
        containerID: containerd://63cbde6dd2d3a1659ce116e83c9545312d1482024c1548704aab755f3dac6313
        exitCode: 2
        finishedAt: "2022-10-13T20:20:54Z"
        reason: Error
        startedAt: "2022-10-13T20:14:55Z"

After further debug container in the pod is killed due to liveness probe failure , made changes to periodSeconds from 10 to 60 on both liveness and readiness probes noticed container is not killed and no replication failures. That said container is being killed only when replicator start replicating the secrets and probes are failing, so something is blocking requests from probes when secrets are replicating.

There has not been any activity to this issue in the last 14 days. It will automatically be closed after 7 more days. Remove the stale label to prevent this.

I am experiencing this same issue.

Its from right here. If the status is not "synced" for all resources, then it will report unhealthy. This shouldn't be the case for a liveness probe. a sync that is not yet completed doesn't mean that it application should report unhealthy

We're still experiencing this issue, is there an update? I noticed PR check failed on spinning up a KIND cluster, can we rerun?