Secrets Manager Stops Syncing

Question

Secrets Manager Stops Syncing

m477r1x opened this issue 4 years ago · 3 comments

Hi, I was wondering if there is any advice on a weird issue we are seeing. I have used external secrets in other jobs and I've never seen instability like this before so it must be something to do with the setup here, but I'm not sure what to look for.

Randomly, the external secrets pod will simply stop syncing secrets. There's not always an error in the logs, but if you kubectl logs -f <podname> you can see the logs are not moving at all. If I delete the pod and let the cluster spin up a new one, all of a sudden things kick back into gear again. I tried to do some digging and found this error on one of the external secrets deployments:

{"level":30,"message_time":"2021-06-17T09:24:10.720Z","pid":17,"hostname":"external-secrets-fd64d899d-qcr8w","msg":"starting poller for monitoring/prometheus-alertmanager-config"}
failed to watch file "/var/lib/docker/containers/8a6570089510d5eb9d1d8e79365fde1cfaa6f18c20f444466be3220311cc86e4/8a6570089510d5eb9d1d8e79365fde1cfaa6f18c20f444466be3220311cc86e4-json.log": no space left on device%

However it seemed like at the time the secrets were indeed syncing ok because when checking a random secret in a namespace with kubectl get externalsecrets i could see the sync time was 10s and the status was SUCCESS. When it gets stuck, the sync status field is blank.

Following on from the error above which appears to be about disk space, i checked the space on the pod which was in the 80-90% range, so not completely full. I then checked the actual node which the pod was running on which I will put below. But long story short, i couldn't see any disk space issues on there.

Node Details

Name:               ip-192-168-89-192.eu-west-1.compute.internal
Roles:              <none>
Labels:             beta.kubernetes.io/arch=amd64
                    beta.kubernetes.io/instance-type=c5.2xlarge
                    beta.kubernetes.io/os=linux
                    eks.amazonaws.com/capacityType=ON_DEMAND
                    eks.amazonaws.com/nodegroup=k8s-staging-private-5
                    eks.amazonaws.com/nodegroup-image=ami-0313d49570831d7f4
                    failure-domain.beta.kubernetes.io/region=eu-west-1
                    failure-domain.beta.kubernetes.io/zone=eu-west-1c
                    kubernetes.io/arch=amd64
                    kubernetes.io/hostname=ip-192-168-89-192.eu-west-1.compute.internal
                    kubernetes.io/os=linux
Annotations:        node.alpha.kubernetes.io/ttl: 0
                    volumes.kubernetes.io/controller-managed-attach-detach: true
CreationTimestamp:  Wed, 28 Oct 2020 15:21:27 +0000
Taints:             <none>
Unschedulable:      false
Lease:
  HolderIdentity:  ip-192-168-89-192.eu-west-1.compute.internal
  AcquireTime:     <unset>
  RenewTime:       Thu, 17 Jun 2021 10:32:15 +0100
Conditions:
  Type             Status  LastHeartbeatTime                 LastTransitionTime                Reason                       Message
  ----             ------  -----------------                 ------------------                ------                       -------
  MemoryPressure   False   Thu, 17 Jun 2021 10:31:23 +0100   Wed, 28 Oct 2020 15:21:23 +0000   KubeletHasSufficientMemory   kubelet has sufficient memory available
  DiskPressure     False   Thu, 17 Jun 2021 10:31:23 +0100   Wed, 28 Oct 2020 15:21:23 +0000   KubeletHasNoDiskPressure     kubelet has no disk pressure
  PIDPressure      False   Thu, 17 Jun 2021 10:31:23 +0100   Wed, 28 Oct 2020 15:21:23 +0000   KubeletHasSufficientPID      kubelet has sufficient PID available
  Ready            True    Thu, 17 Jun 2021 10:31:23 +0100   Wed, 28 Oct 2020 15:22:48 +0000   KubeletReady                 kubelet is posting ready status
Addresses:
  InternalIP:   192.168.89.192
  Hostname:     ip-192-168-89-192.eu-west-1.compute.internal
  InternalDNS:  ip-192-168-89-192.eu-west-1.compute.internal
Capacity:
  attachable-volumes-aws-ebs:  25
  cpu:                         8
  ephemeral-storage:           83873772Ki
  hugepages-1Gi:               0
  hugepages-2Mi:               0
  memory:                      15834764Ki
  pods:                        58
Allocatable:
  attachable-volumes-aws-ebs:  25
  cpu:                         7910m
  ephemeral-storage:           76224326324
  hugepages-1Gi:               0
  hugepages-2Mi:               0
  memory:                      14817932Ki
  pods:                        58
System Info:
  Machine ID:                 ec24954aed146321bbe40eaa8886ada1
  System UUID:                EC24954A-ED14-6321-BBE4-0EAA8886ADA1
  Boot ID:                    896c34f7-d65d-4a2a-bfe7-bc7d374082ec
  Kernel Version:             4.14.198-152.320.amzn2.x86_64
  OS Image:                   Amazon Linux 2
  Operating System:           linux
  Architecture:               amd64
  Container Runtime Version:  docker://19.3.6
  Kubelet Version:            v1.16.13-eks-ec92d4
  Kube-Proxy Version:         v1.16.13-eks-ec92d4
ProviderID:                   aws:///eu-west-1c/i-0f58d71f8cdd4e211
Non-terminated Pods:          (37 in total)
  Namespace                   Name                                                    CPU Requests  CPU Limits   Memory Requests  Memory Limits  AGE
  ---------                   ----                                                    ------------  ----------   ---------------  -------------  ---
  airflow                     airflow-web-864d99f549-hcnzc                            100m (1%)     2 (25%)      128Mi (0%)       1Gi (7%)       42d
  bolt                        bolt-dev-698c4b7766-98gbh                               150m (1%)     2050m (25%)  160Mi (1%)       1056Mi (7%)    23d
  cx                          cruella-78b5978788-5mglv                                200m (2%)     2200m (27%)  268Mi (1%)       1164Mi (8%)    23d
  cx                          merida-7fbdc8b49f-88n9m                                 200m (2%)     2200m (27%)  384Mi (2%)       1280Mi (8%)    41h
  cx                          portal-api-59d49d7dd6-j5pvf                             200m (2%)     2200m (27%)  268Mi (1%)       1164Mi (8%)    42d
  cx                          portal-api-59d49d7dd6-pvp9x                             200m (2%)     2200m (27%)  268Mi (1%)       1164Mi (8%)    70d
  dar                         ppa-billing-worker-86c77d8bfb-mr9tp                     350m (4%)     2500m (31%)  2176Mi (15%)     3Gi (21%)      8d
  flux                        flux-helm-operator-5c7f4d899c-qjqkv                     100m (1%)     100m (1%)    512Mi (3%)       512Mi (3%)     42d
  flux                        memcached-2                                             50m (0%)      250m (3%)    256Mi (1%)       256Mi (1%)     46d
  helios                      baseload-model-0-regressor-6485f47b9f-jzgpm             200m (2%)     2 (25%)      128Mi (0%)       1Gi (7%)       156d
  helios                      ppa-forecast-676d4d8f6d-fkqsp                           300m (3%)     2200m (27%)  384Mi (2%)       1280Mi (8%)    21d
  http-headers                http-headers-c765dfdc5-gplzh                            120m (1%)     2020m (25%)  192Mi (1%)       1088Mi (7%)    231d
  infrastructure              asset-control-5c4c6c5c68-zmsb5                          150m (1%)     2100m (26%)  192Mi (1%)       1088Mi (7%)    9d
  infrastructure              asset-control-vis-frontend-5b85f85695-9xqmf             100m (1%)     2 (25%)      128Mi (0%)       1Gi (7%)       113d
  istio-system                istio-citadel-6bc66499fb-5k7v2                          10m (0%)      0 (0%)       0 (0%)           0 (0%)         23d
  istio-system                istio-galley-7889cdf457-z6vq5                           10m (0%)      0 (0%)       0 (0%)           0 (0%)         23d
  istio-system                istio-ingressgateway-internal-c669f4dfb-v8czn           200m (2%)     4 (50%)      256Mi (1%)       2Gi (14%)      42d
  istio-system                istio-ingressgateway-internal-secure-f44d8878d-9pfw7    200m (2%)     4 (50%)      256Mi (1%)       2Gi (14%)      204d
  kiali-operator              kiali-operator-74c7ff6788-xtqr5                         100m (1%)     2 (25%)      128Mi (0%)       1Gi (7%)       42d
  kube-system                 aws-node-xwhr4                                          10m (0%)      0 (0%)       0 (0%)           0 (0%)         56d
  kube-system                 coredns-89649b947-5zdfn                                 100m (1%)     0 (0%)       70Mi (0%)        170Mi (1%)     156d
  kube-system                 external-secrets-fd64d899d-qcr8w                        0 (0%)        0 (0%)       0 (0%)           0 (0%)         8d
  kube-system                 kube-proxy-nwlvn                                        100m (1%)     0 (0%)       0 (0%)           0 (0%)         231d
  kube-system                 kube2iam-8rgg8                                          0 (0%)        0 (0%)       0 (0%)           0 (0%)         231d
  kube-system                 tiller-deploy-59fc686959-8js7q                          500m (6%)     1 (12%)      512Mi (3%)       512Mi (3%)     231d
  kubernetes-dashboard        dashboard-metrics-scraper-76679bc5b9-5g2p7              0 (0%)        0 (0%)       0 (0%)           0 (0%)         42d
  limejump-api-gateway        jank-file-downloader-64f9fc788-qtbnc                    150m (1%)     2050m (25%)  160Mi (1%)       1056Mi (7%)    42d
  logging                     host-messages-fluentbit-bfkld                           50m (0%)      50m (0%)     64Mi (0%)        64Mi (0%)      51d
  logging                     k8s-fluentbit-nn2vk                                     100m (1%)     100m (1%)    128Mi (0%)       128Mi (0%)     56d
  monitoring                  jaeger-oauth2-proxy-685d57cb9c-jjg9l                    100m (1%)     2 (25%)      128Mi (0%)       1Gi (7%)       231d
  monitoring                  jaeger-operator-7fc89cb645-p75fn                        100m (1%)     2 (25%)      128Mi (0%)       1Gi (7%)       231d
  monitoring                  prometheus-blackbox-exporter-5dfd45f65c-22kvc           100m (1%)     2 (25%)      128Mi (0%)       1Gi (7%)       42d
  monitoring                  prometheus-node-exporter-ghws7                          0 (0%)        0 (0%)       0 (0%)           0 (0%)         231d
  polaris                     polaris-proxy-oauth2-proxy-76869b85b7-tn9w9             0 (0%)        0 (0%)       0 (0%)           0 (0%)         42d
  trading-ci                  trading-tradingdb-0                                     250m (3%)     0 (0%)       256Mi (1%)       0 (0%)         6h15m
  trading                     operation-bot-consumer-686f8cffdd-cztzq                 200m (2%)     2150m (27%)  256Mi (1%)       1152Mi (7%)    45h
  trading                     trading-dev-7dd47c6b95-66r6z                            300m (3%)     2300m (29%)  384Mi (2%)       1280Mi (8%)    38h

I also checked the inode status as well but I must admit I'm not very clued up on troubleshooting issues caused by inode allocations:

Inode info

Filesystem              Inodes      Used Available Use% Mounted on
overlay                  33.3M      4.5M     28.7M  14% /
tmpfs                     1.9M        17      1.9M   0% /dev
tmpfs                     1.9M        16      1.9M   0% /sys/fs/cgroup
/dev/nvme0n1p1           33.3M      4.5M     28.7M  14% /dev/termination-log
/dev/nvme0n1p1           33.3M      4.5M     28.7M  14% /etc/resolv.conf
/dev/nvme0n1p1           33.3M      4.5M     28.7M  14% /etc/hostname
/dev/nvme0n1p1           33.3M      4.5M     28.7M  14% /etc/hosts
shm                       1.9M         1      1.9M   0% /dev/shm
tmpfs                     1.9M         9      1.9M   0% /run/secrets/kubernetes.io/serviceaccount
tmpfs                     1.9M         1      1.9M   0% /proc/acpi
tmpfs                     1.9M        17      1.9M   0% /proc/kcore
tmpfs                     1.9M        17      1.9M   0% /proc/keys
tmpfs                     1.9M        17      1.9M   0% /proc/latency_stats
tmpfs                     1.9M        17      1.9M   0% /proc/timer_list
tmpfs                     1.9M        17      1.9M   0% /proc/sched_debug
tmpfs                     1.9M         1      1.9M   0% /sys/firmware

Any advice would be appreciated!

Answer 1 · 2021-07-22T19:26:00.000Z

Might be related to #763, depending on which version you run.

Answer 2 · 2021-10-21T02:21:13.000Z

This issue is stale because it has been open 90 days with no activity. Remove stale label or comment or this will be closed in 30 days.

Answer 3 · 2021-11-21T02:10:55.000Z

This issue was closed because it has been stalled for 30 days with no activity.