Stale alerts for pods/nodes that don't exist (5.5.52+)
ulysseskan opened this issue · 2 comments
Description
We have 2 tickets open regarding high memory alerts on pods that either do not exist (SE-19032 "gravity status is showing alerts for pods which do not exist", SE-19128 "When are Gravity alerts cleared?").
Though the pod alerts are custom TICK scripts by the customer, I've managed to reproduce this with our shipping high memory alerts for nodes.
What happened: gravity status shows alerts for pods/nodes that do not exist, ex. Cluster alerts: * [CRITICAL] CRITICAL / ... pod ...-...-container-UUID has high memory usage: 90.01%".
In SE-19128, customer sees the alert for more than a month even though the apps are not running.
What you expected to happen: No alerts for pods/nodes that do not exist after 10+ minutes.
How to reproduce it (as minimally and precisely as possible):
- Install a 3 node gravity 5.5.52+ cluster (tested with 5.5.56)
- Install stress test from https://github.com/giantswarm/kube-stresscheck
watch sudo gravity status
. Wait 5 minutes. You will start seeing high memory usage alerts for the node with the stress check pod.- While you still have the high memory alerts present,
sudo gravity remove
node containing the stress-test pod. (do not evict pods manually or remove the stress test pod, just run the gravity remove. - Keep watching gravity status. Note the shrink operation completes, but you'll continue to see the high memory usage alert on the removed node's IP. You will even get a NEW alert for the removed node saying br_netfilter module not loaded, after the node is already removed. You can shut down the node and the alerts will remain. Also note that in this case, the removed node did not have kapacitor on it.
- Even after 1 hour, you will still see stale alerts in gravity status:
Every 2.0s: sudo gravity status node2: Wed Jan 27 00:43:27 2021
Cluster name: awesomewescoff5340
Cluster status: active
Application: cluster-image, version 0.0.2
Gravity version: 5.5.56 (client) / 5.5.56 (server)
Join token: vagrant_install
Last completed operation:
* operation_shrink (6198670a-3e8d-495c-853d-89f8cb611c86)
started: Tue Jan 26 23:38 UTC (1 hour ago)
completed: Tue Jan 26 23:39 UTC (1 hour ago)
Cluster endpoints:
* Authentication gateway:
- 10.0.2.30:32009
- 10.0.2.50:32009
* Cluster management URL:
- https://10.0.2.30:32009
- https://10.0.2.50:32009
Cluster nodes:
Masters:
* node2 (10.0.2.30, node)
Status: healthy
Remote access: online
* ubuntu-focal (10.0.2.50, node)
Status: healthy
Remote access: online
Cluster alerts:
* [WARNING] WARNING / Node 10.0.2.25 has high memory usage: 80.56850242888261%
* [CRITICAL] br_netfilter module is not loaded on node 10.0.2.25
* [WARNING] WARNING / Node 10.0.2.25 was rebooted
Can attach gravity report and kapacitor info, if needed.
Environment
- Gravity version [e.g. 7.0.11]: 5.5.56, also first seen by customer in 5.5.52, when #2039 was added
- OS [e.g. Redhat 7.4]: Ubuntu 20.04
- Platform [e.g. Vmware, AWS]: AWS, local
Workaround
Manual workaround to remove the stale alert (side effect: removes alert history)
- Inspect the kapacitor pod and look at the alerts
kubectl -n monitoring exec kapacitor-74bdc655-m4thr -c alert-loader -- kapacitor --url http://localhost:9092 list topics
- Show the alert details for the alert you want to delete:
kubectl -n monitoring exec kapacitor-74bdc655-m4thr -c alert-loader -- kapacitor --url http://localhost:9092 show-topic main:uptime:alert9
- Delete the alert you want to remove:
kubectl -n monitoring exec kapacitor-74bdc655-m4thr -c alert-loader -- kapacitor --url http://localhost:9092 delete topics main:uptime:alert9
Ref: https://docs.influxdata.com/kapacitor/v1.5/working/cli_client/#delete-topics
Quick bash script that implements the workaround and can be scheduled via a kubernetes cronjob. You will need to modify the alert topic name (in the case below, it's main:high_memory:alert9)
#!/bin/bash
#
# quick way to delete high memory alerts (also deletes alert history)
# see https://docs.influxdata.com/kapacitor/v1.5/working/cli_client/#delete-topics for usage examples
# use list topics to see current alerts, and show-topic to show the alert contents
# can also use globs, such as delete topics "*high_memory*"
#
POD=$(kubectl get pod -l app=monitoring,component=kapacitor -n monitoring -o jsonpath="{.items[0].metadata.name}")
kubectl -n monitoring exec -ti $POD -c alert-loader -- kapacitor --url http://localhost:9092 delete topics main:high_memory:alert9
Can also confirm that existing stale alerts temporarily go away if you upgrade a cluster to a later version of gravity, for example, from 5.5.56 to 5.5.57, even though there are no fixes for this issue in 5.5.57. It's just a side effect of upgrading. There's nothing preventing new stale alerts from occurring in 5.5.57. But I got the old alert to come back after degrading the cluster on purpose for a bit.
Have not tested Gravity 6.x and 7.x, but those versions have different monitoring stacks which don't use Kapacitor, so I've been told it isn't a problem.
This actually seems like a kapacitor problem:
https://community.influxdata.com/t/kapacitor-stateless-alerts/1647
https://community.influxdata.com/t/how-to-delete-an-alert-event/6941/7
Root problem is, when a source of data disappears, the influx monitoring stack doesn't remove any alerts that have fired for that source.
One thing suggested today that we could do to the old 5.5.x version is rollback #2039 and move the alerts to something like gravity status --alerthistory
with a note that mentions that due to kapacitor limitations, some of these alerts may be for resources that no longer exist.