grafana/cortex-jsonnet

CortexIngesterRestarts

pstibrany opened this issue · 10 comments

Right now CortexIngesterRestarts alert is noisy. It looks like this:

rate(process_start_time_seconds{job=~".+(cortex|ingester)"}[30m]) > 0

Rate over 30 minutes produces values like ~300 even for a single restart:

Screenshot 2020-06-30 at 14 04 28

We could use changes instead. Should we increase number of restarts (using changes) to be higher to 1 before alerting?

One solution would be to only alert on a certain frequency of restarts:

changes(process_start_time_seconds{job=~".+(cortex|ingester)"}[30m]) > 5

Another possibility could be to ask K8s about its idea of restarts. I believe that would avoid alerting on deliberate restarts due to updates etc.:

increase(kube_pod_container_status_restarts_total{job=~".+(cortex|ingester)", pod=~"ingester-.*"}[1h]) > 1

I don't think we should dive down to checking the exit code. (Does the ingester ever exit regularly at all?)

Does the ingester ever exit regularly at all?

No, only during rollouts and crashes.

Answering my own question:

Does the ingester ever exit regularly at all?

I guess it does if K8s SIGTERMs it during a rollout. But I still think checking for that exitcode would be overkill (and it might even be that some binaries return a non-zero exit code when exiting because of SIGTERM).

My suggestions above should work (the first because of frequency, and the second I believe should actually just see irregular restarts that K8s has seen).

Another possibility could be to ask K8s about its idea of restarts.

We used to do this but we were trying to decouple the alerts from kubernetes so we could use them for cortex on bare metal.

I wonder whether we should just revert this alert. Also, crashing ingesters aren't such a terrible thing anymore, so perhaps we should just say "more than once per hour" or something. Would need to be a long enough period to tolerate a crash after WAL replay.

We used to do this but we were trying to decouple the alerts from kubernetes so we could use them for cortex on bare metal.

I see.

Then I would go for my first suggestion: Just alert if the frequency of restarts is so high that it is unlikely to be triggered by a rollout.

My take:

  • Being paged during a normal rollout is wrong, so we should clearly fix it
  • Setting a threshold > 1 would protect from rollouts and a single crash happening within a 30m time window, while being paged if 2 crashes occur in a 30m is definitely worth a page

Agreed, going to update the alert now.

My advice wasn't followed closely enough. (o:

Oh I missed changes vs increase! Sorry about that. Would you mind fixing that?