external-secrets/kubernetes-external-secrets

status check went boom

djfinnoy opened this issue · 13 comments

Possibly same issue as #362.

Running v8.3.0, installed via Helm chart, on GKE (1.20.8-gke.900).
The cluster gets bootstrapped by an automated pipeline, and external-secrets is expected to retrieve a single secret from Google Secret Manager.
I works about 50% of the time.

When it doesn't work, I'm seeing the following lines repeated in the pod logs:

{"level":50,"message_time":"2021-08-23T08:31:59.073Z","pid":18,"hostname":"external-secrets-kubernetes-external-secrets-b44fd8967-dgplh","payload":{},"msg":"status check went boom for oauth2-proxy/sm-oauth-github"}
{"level":30,"message_time":"2021-08-23T08:32:59.123Z","pid":18,"hostname":"external-secrets-kubernetes-external-secrets-b44fd8967-dgplh","msg":"No watch event for 60000 ms, restarting watcher for *"}
{"level":30,"message_time":"2021-08-23T08:32:59.123Z","pid":18,"hostname":"external-secrets-kubernetes-external-secrets-b44fd8967-dgplh","msg":"Stopping watch stream for namespace * due to event: END"}
{"level":30,"message_time":"2021-08-23T08:32:59.131Z","pid":18,"hostname":"external-secrets-kubernetes-external-secrets-b44fd8967-dgplh","msg":"starting poller for oauth2-proxy/sm-oauth-github"}
{"level":50,"message_time":"2021-08-23T08:32:59.131Z","pid":18,"hostname":"external-secrets-kubernetes-external-secrets-b44fd8967-dgplh","payload":{},"msg":"status check went boom for oauth2-proxy/sm-oauth-github"

Deleting the pod solves the problem, but I have not checked whether the problem re-occurs.
I never saw this problem before upgrading from 6.2.0 to 8.3.0.

I have the same problem in OpenShift bare metal cluster.
Anybody have a workaround to avoid this?

I have the same issue in Azure Kubernetes 1.20.7.

I had a similar issue in 6.4.0, and this persists in 8.3.2

Kubernetes External Secrets Version: 8.3.2
Vault Cluster backed by postgres: 1.7.3

This happened to me again at 18:36. Vault cluster was just fine and I could log in. External secrets simply stopped polling. This is affecting production systems. Please advise.

This happened again at 10:01am

{"level":30,"message_time":"2021-10-27T18:20:53.906Z","pid":17,"hostname":"kubernetes-external-secrets-69c544c96d-tkncr","msg":"No watch event for 60000 ms, restarting watcher for *"}

This happened again at 10:01am

{"level":30,"message_time":"2021-10-27T18:20:53.906Z","pid":17,"hostname":"kubernetes-external-secrets-69c544c96d-tkncr","msg":"No watch event for 60000 ms, restarting watcher for *"}

I'm not following, this particular log line is to be expected even in a healthy cluster or setup. The operator is watching a HTTP stream and for some reason those are not cleanly closing towards the kubernetes api, so the operator will as a precaution restart the watch every 60 seconds (or whatever is configured). Otherwise the operator could hang forever, not knowing if nothing happened or if there will never be any new events published on the stream, for more background see #362

Possibly same issue as #362.

Running v8.3.0, installed via Helm chart, on GKE (1.20.8-gke.900). The cluster gets bootstrapped by an automated pipeline, and external-secrets is expected to retrieve a single secret from Google Secret Manager. I works about 50% of the time.

When it doesn't work, I'm seeing the following lines repeated in the pod logs:

{"level":50,"message_time":"2021-08-23T08:31:59.073Z","pid":18,"hostname":"external-secrets-kubernetes-external-secrets-b44fd8967-dgplh","payload":{},"msg":"status check went boom for oauth2-proxy/sm-oauth-github"}
{"level":30,"message_time":"2021-08-23T08:32:59.123Z","pid":18,"hostname":"external-secrets-kubernetes-external-secrets-b44fd8967-dgplh","msg":"No watch event for 60000 ms, restarting watcher for *"}
{"level":30,"message_time":"2021-08-23T08:32:59.123Z","pid":18,"hostname":"external-secrets-kubernetes-external-secrets-b44fd8967-dgplh","msg":"Stopping watch stream for namespace * due to event: END"}
{"level":30,"message_time":"2021-08-23T08:32:59.131Z","pid":18,"hostname":"external-secrets-kubernetes-external-secrets-b44fd8967-dgplh","msg":"starting poller for oauth2-proxy/sm-oauth-github"}
{"level":50,"message_time":"2021-08-23T08:32:59.131Z","pid":18,"hostname":"external-secrets-kubernetes-external-secrets-b44fd8967-dgplh","payload":{},"msg":"status check went boom for oauth2-proxy/sm-oauth-github"

Deleting the pod solves the problem, but I have not checked whether the problem re-occurs. I never saw this problem before upgrading from 6.2.0 to 8.3.0.

Hard to tell what goes wrong from this, and odd that there's no error being logged since this particular log line (status check went boom) does try to log an error 😕

This happened again at 10:01am

{"level":30,"message_time":"2021-10-27T18:20:53.906Z","pid":17,"hostname":"kubernetes-external-secrets-69c544c96d-tkncr","msg":"No watch event for 60000 ms, restarting watcher for *"}

I'm not following, this particular log line is to be expected even in a healthy cluster or setup. The operator is watching a HTTP stream and for some reason those are not cleanly closing towards the kubernetes api, so the operator will as a precaution restart the watch every 60 seconds (or whatever is configured). Otherwise the operator could hang forever, not knowing if nothing happened or if there will never be any new events published on the stream, for more background see #362

Yes this is the last logline before KES simply stops upserting. There are NO errors outputted for me to provide you. The poller simply stops polling. This is happening DAILY at this point. We need a fix our I'm going to have to stop using this engine.

I also observed this issue in bare metal Kubernetes Clusters with Vault. Some secrets are not updated anymore with no error besides "status check went boom". This also happened with a completely new ExternalSecret once. The ExternalSecret was not updated at all (no error message in the custom resource).

Kubernetes version: v1.20.11
KES version: 8.3.0
Vault version: 1.7.1+ent

I wish I had an easy fix 😄
However I can only advise to look into ESO https://github.com/external-secrets/external-secrets
See #864

Can someone provide metrics for KES around the time where it stops polling? I'm particularly interested in cpu/mem/netstat and nodejs_* related metrics. It's odd that the eventloop seems to stop 🤷.
Let me find some time to look into that. I think we can improve a little on the observability side.

I couldn't find an issue using the aws provider. I let it run for 3 days, with 1/10/100/500 Kind=ExternalSecret. CPU usage was highish but it was syncing properly.
I opened an PR (see above) that exports additional nodejs_* runtime metrics that will be helpful debugging this issue. Please apply the patch manually or wait for a release.
Please provide some additional information like how many ExternalSecrets you have, your POLLING_INTERVAL and please attach strace to process in the container so we might see where it's stuck.

This issue is stale because it has been open 90 days with no activity. Remove stale label or comment or this will be closed in 30 days.

This issue was closed because it has been stalled for 30 days with no activity.