Secrets Manager Stops Syncing
m477r1x opened this issue · 3 comments
Hi, I was wondering if there is any advice on a weird issue we are seeing. I have used external secrets in other jobs and I've never seen instability like this before so it must be something to do with the setup here, but I'm not sure what to look for.
Randomly, the external secrets pod will simply stop syncing secrets. There's not always an error in the logs, but if you kubectl logs -f <podname>
you can see the logs are not moving at all. If I delete the pod and let the cluster spin up a new one, all of a sudden things kick back into gear again. I tried to do some digging and found this error on one of the external secrets deployments:
{"level":30,"message_time":"2021-06-17T09:24:10.720Z","pid":17,"hostname":"external-secrets-fd64d899d-qcr8w","msg":"starting poller for monitoring/prometheus-alertmanager-config"}
failed to watch file "/var/lib/docker/containers/8a6570089510d5eb9d1d8e79365fde1cfaa6f18c20f444466be3220311cc86e4/8a6570089510d5eb9d1d8e79365fde1cfaa6f18c20f444466be3220311cc86e4-json.log": no space left on device%
However it seemed like at the time the secrets were indeed syncing ok because when checking a random secret in a namespace with kubectl get externalsecrets
i could see the sync time was 10s and the status was SUCCESS
. When it gets stuck, the sync status field is blank.
Following on from the error above which appears to be about disk space, i checked the space on the pod which was in the 80-90% range, so not completely full. I then checked the actual node which the pod was running on which I will put below. But long story short, i couldn't see any disk space issues on there.
Node Details
Name: ip-192-168-89-192.eu-west-1.compute.internal
Roles: <none>
Labels: beta.kubernetes.io/arch=amd64
beta.kubernetes.io/instance-type=c5.2xlarge
beta.kubernetes.io/os=linux
eks.amazonaws.com/capacityType=ON_DEMAND
eks.amazonaws.com/nodegroup=k8s-staging-private-5
eks.amazonaws.com/nodegroup-image=ami-0313d49570831d7f4
failure-domain.beta.kubernetes.io/region=eu-west-1
failure-domain.beta.kubernetes.io/zone=eu-west-1c
kubernetes.io/arch=amd64
kubernetes.io/hostname=ip-192-168-89-192.eu-west-1.compute.internal
kubernetes.io/os=linux
Annotations: node.alpha.kubernetes.io/ttl: 0
volumes.kubernetes.io/controller-managed-attach-detach: true
CreationTimestamp: Wed, 28 Oct 2020 15:21:27 +0000
Taints: <none>
Unschedulable: false
Lease:
HolderIdentity: ip-192-168-89-192.eu-west-1.compute.internal
AcquireTime: <unset>
RenewTime: Thu, 17 Jun 2021 10:32:15 +0100
Conditions:
Type Status LastHeartbeatTime LastTransitionTime Reason Message
---- ------ ----------------- ------------------ ------ -------
MemoryPressure False Thu, 17 Jun 2021 10:31:23 +0100 Wed, 28 Oct 2020 15:21:23 +0000 KubeletHasSufficientMemory kubelet has sufficient memory available
DiskPressure False Thu, 17 Jun 2021 10:31:23 +0100 Wed, 28 Oct 2020 15:21:23 +0000 KubeletHasNoDiskPressure kubelet has no disk pressure
PIDPressure False Thu, 17 Jun 2021 10:31:23 +0100 Wed, 28 Oct 2020 15:21:23 +0000 KubeletHasSufficientPID kubelet has sufficient PID available
Ready True Thu, 17 Jun 2021 10:31:23 +0100 Wed, 28 Oct 2020 15:22:48 +0000 KubeletReady kubelet is posting ready status
Addresses:
InternalIP: 192.168.89.192
Hostname: ip-192-168-89-192.eu-west-1.compute.internal
InternalDNS: ip-192-168-89-192.eu-west-1.compute.internal
Capacity:
attachable-volumes-aws-ebs: 25
cpu: 8
ephemeral-storage: 83873772Ki
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 15834764Ki
pods: 58
Allocatable:
attachable-volumes-aws-ebs: 25
cpu: 7910m
ephemeral-storage: 76224326324
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 14817932Ki
pods: 58
System Info:
Machine ID: ec24954aed146321bbe40eaa8886ada1
System UUID: EC24954A-ED14-6321-BBE4-0EAA8886ADA1
Boot ID: 896c34f7-d65d-4a2a-bfe7-bc7d374082ec
Kernel Version: 4.14.198-152.320.amzn2.x86_64
OS Image: Amazon Linux 2
Operating System: linux
Architecture: amd64
Container Runtime Version: docker://19.3.6
Kubelet Version: v1.16.13-eks-ec92d4
Kube-Proxy Version: v1.16.13-eks-ec92d4
ProviderID: aws:///eu-west-1c/i-0f58d71f8cdd4e211
Non-terminated Pods: (37 in total)
Namespace Name CPU Requests CPU Limits Memory Requests Memory Limits AGE
--------- ---- ------------ ---------- --------------- ------------- ---
airflow airflow-web-864d99f549-hcnzc 100m (1%) 2 (25%) 128Mi (0%) 1Gi (7%) 42d
bolt bolt-dev-698c4b7766-98gbh 150m (1%) 2050m (25%) 160Mi (1%) 1056Mi (7%) 23d
cx cruella-78b5978788-5mglv 200m (2%) 2200m (27%) 268Mi (1%) 1164Mi (8%) 23d
cx merida-7fbdc8b49f-88n9m 200m (2%) 2200m (27%) 384Mi (2%) 1280Mi (8%) 41h
cx portal-api-59d49d7dd6-j5pvf 200m (2%) 2200m (27%) 268Mi (1%) 1164Mi (8%) 42d
cx portal-api-59d49d7dd6-pvp9x 200m (2%) 2200m (27%) 268Mi (1%) 1164Mi (8%) 70d
dar ppa-billing-worker-86c77d8bfb-mr9tp 350m (4%) 2500m (31%) 2176Mi (15%) 3Gi (21%) 8d
flux flux-helm-operator-5c7f4d899c-qjqkv 100m (1%) 100m (1%) 512Mi (3%) 512Mi (3%) 42d
flux memcached-2 50m (0%) 250m (3%) 256Mi (1%) 256Mi (1%) 46d
helios baseload-model-0-regressor-6485f47b9f-jzgpm 200m (2%) 2 (25%) 128Mi (0%) 1Gi (7%) 156d
helios ppa-forecast-676d4d8f6d-fkqsp 300m (3%) 2200m (27%) 384Mi (2%) 1280Mi (8%) 21d
http-headers http-headers-c765dfdc5-gplzh 120m (1%) 2020m (25%) 192Mi (1%) 1088Mi (7%) 231d
infrastructure asset-control-5c4c6c5c68-zmsb5 150m (1%) 2100m (26%) 192Mi (1%) 1088Mi (7%) 9d
infrastructure asset-control-vis-frontend-5b85f85695-9xqmf 100m (1%) 2 (25%) 128Mi (0%) 1Gi (7%) 113d
istio-system istio-citadel-6bc66499fb-5k7v2 10m (0%) 0 (0%) 0 (0%) 0 (0%) 23d
istio-system istio-galley-7889cdf457-z6vq5 10m (0%) 0 (0%) 0 (0%) 0 (0%) 23d
istio-system istio-ingressgateway-internal-c669f4dfb-v8czn 200m (2%) 4 (50%) 256Mi (1%) 2Gi (14%) 42d
istio-system istio-ingressgateway-internal-secure-f44d8878d-9pfw7 200m (2%) 4 (50%) 256Mi (1%) 2Gi (14%) 204d
kiali-operator kiali-operator-74c7ff6788-xtqr5 100m (1%) 2 (25%) 128Mi (0%) 1Gi (7%) 42d
kube-system aws-node-xwhr4 10m (0%) 0 (0%) 0 (0%) 0 (0%) 56d
kube-system coredns-89649b947-5zdfn 100m (1%) 0 (0%) 70Mi (0%) 170Mi (1%) 156d
kube-system external-secrets-fd64d899d-qcr8w 0 (0%) 0 (0%) 0 (0%) 0 (0%) 8d
kube-system kube-proxy-nwlvn 100m (1%) 0 (0%) 0 (0%) 0 (0%) 231d
kube-system kube2iam-8rgg8 0 (0%) 0 (0%) 0 (0%) 0 (0%) 231d
kube-system tiller-deploy-59fc686959-8js7q 500m (6%) 1 (12%) 512Mi (3%) 512Mi (3%) 231d
kubernetes-dashboard dashboard-metrics-scraper-76679bc5b9-5g2p7 0 (0%) 0 (0%) 0 (0%) 0 (0%) 42d
limejump-api-gateway jank-file-downloader-64f9fc788-qtbnc 150m (1%) 2050m (25%) 160Mi (1%) 1056Mi (7%) 42d
logging host-messages-fluentbit-bfkld 50m (0%) 50m (0%) 64Mi (0%) 64Mi (0%) 51d
logging k8s-fluentbit-nn2vk 100m (1%) 100m (1%) 128Mi (0%) 128Mi (0%) 56d
monitoring jaeger-oauth2-proxy-685d57cb9c-jjg9l 100m (1%) 2 (25%) 128Mi (0%) 1Gi (7%) 231d
monitoring jaeger-operator-7fc89cb645-p75fn 100m (1%) 2 (25%) 128Mi (0%) 1Gi (7%) 231d
monitoring prometheus-blackbox-exporter-5dfd45f65c-22kvc 100m (1%) 2 (25%) 128Mi (0%) 1Gi (7%) 42d
monitoring prometheus-node-exporter-ghws7 0 (0%) 0 (0%) 0 (0%) 0 (0%) 231d
polaris polaris-proxy-oauth2-proxy-76869b85b7-tn9w9 0 (0%) 0 (0%) 0 (0%) 0 (0%) 42d
trading-ci trading-tradingdb-0 250m (3%) 0 (0%) 256Mi (1%) 0 (0%) 6h15m
trading operation-bot-consumer-686f8cffdd-cztzq 200m (2%) 2150m (27%) 256Mi (1%) 1152Mi (7%) 45h
trading trading-dev-7dd47c6b95-66r6z 300m (3%) 2300m (29%) 384Mi (2%) 1280Mi (8%) 38h
I also checked the inode status as well but I must admit I'm not very clued up on troubleshooting issues caused by inode allocations:
Inode info
Filesystem Inodes Used Available Use% Mounted on
overlay 33.3M 4.5M 28.7M 14% /
tmpfs 1.9M 17 1.9M 0% /dev
tmpfs 1.9M 16 1.9M 0% /sys/fs/cgroup
/dev/nvme0n1p1 33.3M 4.5M 28.7M 14% /dev/termination-log
/dev/nvme0n1p1 33.3M 4.5M 28.7M 14% /etc/resolv.conf
/dev/nvme0n1p1 33.3M 4.5M 28.7M 14% /etc/hostname
/dev/nvme0n1p1 33.3M 4.5M 28.7M 14% /etc/hosts
shm 1.9M 1 1.9M 0% /dev/shm
tmpfs 1.9M 9 1.9M 0% /run/secrets/kubernetes.io/serviceaccount
tmpfs 1.9M 1 1.9M 0% /proc/acpi
tmpfs 1.9M 17 1.9M 0% /proc/kcore
tmpfs 1.9M 17 1.9M 0% /proc/keys
tmpfs 1.9M 17 1.9M 0% /proc/latency_stats
tmpfs 1.9M 17 1.9M 0% /proc/timer_list
tmpfs 1.9M 17 1.9M 0% /proc/sched_debug
tmpfs 1.9M 1 1.9M 0% /sys/firmware
Any advice would be appreciated!
Might be related to #763, depending on which version you run.
This issue is stale because it has been open 90 days with no activity. Remove stale label or comment or this will be closed in 30 days.
This issue was closed because it has been stalled for 30 days with no activity.