CPUThrottlingHigh alert too easily triggered
lnovara opened this issue · 4 comments
The CPUThrottlingHigh
alert is used to notify if the Kubelet is throttling the pod's CPU usage for more than 25% of time in the last 5 minutes. While this might indicate wrong CPU limits there are particular class of pods (e.g. node-exporter) that are more subject to throttling than other.
My proposal is either to drop this alert or to increase the threshold to, at least, 50-75% given the narrow time window.
What's your opinion about this?
I am also attaching some issues from upstream projects.
Refs:
Is not easy to tune this kind of alert, is very domain-specific and depends a lot on the application's behavior. We can try to set it higher (50-75%) as you suggested, and analyze the behavior after that change.
It's not clear to me if the problem is the alert itself or the current limits values in use.
If I understand correctly, you are proposing to change only the alert threshold, right? In that case, we won't know that the pods are being throttled. How much of an issue is that?
Would it make sense to drop the limits instead for some pods like OpenShift did for the monitoring ones?
Can we temporarily disable this kind of alert? I think that is causing more problems than advantages and the great risk of all is that you eventually get used to see alerting in the Slack channel, and this drives to don't bother to them.
I am literally riddled by these alerts 🔫
being that we are being DDoS by these alerts 😄 I'd say let's drop them and revisit them in the future. Or if possible leave them muted by default so one can go to alertmanager's dashboard and see them if needed.