healthchecks?
Closed this issue ยท 13 comments
Hi,
Is there a way to perform health checks on the controller using kubernetes probes?
Thanks
Something a bit more built-in would be ideal, but the following is a reasonable strategy:
livenessProbe:
exec:
command:
- /bin/sh
- -c
- '! wget -q -O- localhost:3001/metrics | grep status=\"error\"'
initialDelaySeconds: 30,
periodSeconds: 30,
Checks the prometheus metrics to see if there have been any sync errors.
Thanks @timmyers.
Would we want to restart the pod if there are sync errors? Doesn't seem like it's the right healthcheck to perform.
It's definitely not ideal and I can think of cases it wouldn't be what you want. For our set-up though, generally there are 0 sync errors unless something seems wrong with the pod and all of its communication is failing. A restart seems to fix it, so this might be better than no health check.
Agreed with @timmyers -- we also have syncing stop occurring without producing any errors in the logs, and a kubectl rollout restart kubernetes-external-secrets
magically fixes it. I've tried the following and so far so good, but @timmyers's solution seems usable too, albeit both are of course less ideal than something more built-in.
livenessProbe:
exec:
command:
- /bin/sh
- -c
- "cd ~ && wget -O es.json --no-check-certificate --header \"Authorization: Bearer $(cat /var/run/secrets/kubernetes.io/serviceaccount/token)\" \"https://$KUBERNETES_SERVICE_HOST:$KUBERNETES_SERVICE_PORT/apis/kubernetes-client.io/v1/namespaces/default/externalsecrets\""
periodSeconds: 30
My backup plan may be to restart it every 5 min/run it as a CronJob ๐
Thanks guys, i'll implement the probe suggested by @timmyers until a proper healthcheck is added.
I've ended up removing the livenessProbe right away as it was restarting the container due to errors that are not related to the pod's communications. The container can function as expected and throw errors due to a variety of reasons such as lack of permissions to a specific secret.
I would have readiness probe just be the same check as liveness in this case.
And that's unfortunate :( I have been pretty good with my probe so far (API call to get ExternalSecrets
in the default
namespace), but I'm still not sure if this accounts for all bad states that this controller can get into.
It's just a GET against the Kubernetes API to retrieve ExternalSecret
resources from the default
namespace. It's the same effect as kubectl get externalsecrets
(but kubectl
binary is not installed in the controller, so not a convenient option).
Again, not saying this is a comprehensive fix by any means, but in my case the controller gets into a state where it's unable to detect new ExternalSecret
resources, and it's magically fixed by a restart, so this should hopefully cover that at least.
I didn't test wget liveness probe and i don't think it actually solves the problem here because the open tcp connection from the watch request is being dropped on a different network hop and checking if we can open a new connection does not detect that particular issue. The livenessprobe should just work โข๏ธ.
I'd propose to have a /healthz
http endpoint that checks if 1) the reconciliation loop is running and 2) we're able to pull ExternalSecrets from the kube-apiserver.
This issue is stale because it has been open 90 days with no activity. Remove stale label or comment or this will be closed in 30 days.
This issue was closed because it has been stalled for 30 days with no activity.