caas-team/sparrow

Bug: Missing startup probe causes gap in data collection

niklastreml opened this issue · 5 comments

Is there an existing issue for this?

  • I have searched the existing issues

Current Behavior

If a sparrow pod is restarted, it causes a gap in monitoring. This is due to the healthcheck returning an HTTP 200 to the kubernetes probe. Since Kubernetes thinks the new sparrow pod is ready to accept traffic, prometheus scrape requests are sent to that new pod immediately, before the startup routine is actually finished.

Screenshot from prometheus below

Expected Behavior

A restart of the sparrow deployment waits for the new pod to have finished its first reconciliation cycle. Only then will the pod accept traffic. This ensures, that prometheus never scrapes a pod while still in startup.

Steps To Reproduce

  1. Spin up sparrow on kubernetes
  2. Restart the deployment
  3. HTTP GET /metrics or /openapi
    You will notice sparrow returns no metrics for a few seconds after startup. It also does not show any checks in the /openapi endpoint

Relevant logs and/or screenshots, environment information, etc.

image

Who can address the issue?

No response

Anything else?

No response

I've started to implement this in feat/startup-probe. But imo we should wait with implementing this after #72 and #92 are merged because then it'll be easier to register a /healthz endpoint that checks if all configured checks are running properly with their configuration.

I've implemented the other half of this issue, regarding the /healthz endpoint I mentioned before, in feat/healthz-endpoint. Therefore I've implemented yet another sparrow component named Checker that requests each API endpoint and checks if it's healthy (returns a StatusOK).

If the problem is the gap in metrics (which is isn't a problem we should concern ourselves with but the people accessing the metrics), then the solution is the following :

As long as the check hasn't produced any metrics, don't expose a metrics endpoint or even metrics for the check.

I guess this would complicate the check registration process, if we implement something like that.

IMO this is a non-issue. The metrics exposed by the check are the last executed check results. If no check has been executed, no results can be delivered. If a check runs once every year, then the same result is delivered almost 365 days long.

The healthy and ready endpoint of the sparrow shouldn't have anything to do with the metrics endpoint. The sparrow exposes an API. If it can be served, then it's ready. If the sparrow's / is accessible, then it's healthy.

Mixing those two concerns with the check's state is mixing concerns. A controller is healthy and ready to accept requests even when the controlled resources aren't ready.

@puffitos I agree with you on this. Maybe we should just give the user the possibility to fix the gap in their metrics themself by setting startup probe values depending on their configuration. That way we wouldn't need #115 anymore and I can close it. 🙂

I don't see the need for an extra /healtz endpoint at the moment. The root API endpoint can be used for startup and readiness probes in my opinion.

Edit: I see the sparrow as ready as soon as the first config is loaded. With this we first need to get the config as soon as the sparrow has startup. At the moment we get it the first time after the interval.

Edit: I would still start with root endpoint for readiness & startup probe as soon as config reload & check process is started directly at startup