CortexIngesterRestarts

Question

CortexIngesterRestarts

pstibrany opened this issue 4 years ago · 10 comments

Right now CortexIngesterRestarts alert is noisy. It looks like this:

rate(process_start_time_seconds{job=~".+(cortex|ingester)"}[30m]) > 0

Rate over 30 minutes produces values like ~300 even for a single restart:

We could use changes instead. Should we increase number of restarts (using changes) to be higher to 1 before alerting?

beorn7 commented 4 years ago

#123

Answer 1 · 2020-06-30T14:37:32.000Z

One solution would be to only alert on a certain frequency of restarts:

changes(process_start_time_seconds{job=~".+(cortex|ingester)"}[30m]) > 5

Another possibility could be to ask K8s about its idea of restarts. I believe that would avoid alerting on deliberate restarts due to updates etc.:

increase(kube_pod_container_status_restarts_total{job=~".+(cortex|ingester)", pod=~"ingester-.*"}[1h]) > 1

I don't think we should dive down to checking the exit code. (Does the ingester ever exit regularly at all?)

Answer 2 · 2020-06-30T14:40:32.000Z

Does the ingester ever exit regularly at all?

No, only during rollouts and crashes.

Answer 3 · 2020-06-30T14:41:41.000Z

Answering my own question:

Does the ingester ever exit regularly at all?

I guess it does if K8s SIGTERMs it during a rollout. But I still think checking for that exitcode would be overkill (and it might even be that some binaries return a non-zero exit code when exiting because of SIGTERM).

My suggestions above should work (the first because of frequency, and the second I believe should actually just see irregular restarts that K8s has seen).

Answer 4 · 2020-07-01T09:47:56.000Z

Another possibility could be to ask K8s about its idea of restarts.

We used to do this but we were trying to decouple the alerts from kubernetes so we could use them for cortex on bare metal.

I wonder whether we should just revert this alert. Also, crashing ingesters aren't such a terrible thing anymore, so perhaps we should just say "more than once per hour" or something. Would need to be a long enough period to tolerate a crash after WAL replay.

Answer 5 · 2020-07-01T10:12:06.000Z

We used to do this but we were trying to decouple the alerts from kubernetes so we could use them for cortex on bare metal.

I see.

Then I would go for my first suggestion: Just alert if the frequency of restarts is so high that it is unlikely to be triggered by a rollout.

Answer 6 · 2020-07-01T11:37:12.000Z

My take:

Being paged during a normal rollout is wrong, so we should clearly fix it
Setting a threshold > 1 would protect from rollouts and a single crash happening within a 30m time window, while being paged if 2 crashes occur in a 30m is definitely worth a page

Answer 7 · 2020-07-01T12:29:45.000Z

Agreed, going to update the alert now.

Answer 8 · 2020-07-01T13:26:51.000Z

My advice wasn't followed closely enough. (o:

Answer 9 · 2020-07-01T14:09:50.000Z

Oh I missed changes vs increase! Sorry about that. Would you mind fixing that?