Refacto: write more accurate descriptions for faster troubleshooting
samber opened this issue · 3 comments
Example:
From:
- name: Prometheus rule evaluation failures
description: 'Prometheus encountered {{ $value }} rule evaluation failures.'
query: 'increase(prometheus_rule_evaluation_failures_total[3m]) > 0'
To:
- name: Prometheus rule evaluation failures
description: 'Prometheus encountered {{ $value }} rule evaluation failures, leading to potentially ignored alerts.'
query: 'increase(prometheus_rule_evaluation_failures_total[3m]) > 0'
An effect
field would enable us to improve alert template.
We can probably find a balance between:
- Description/cause
- Effects
- Resolution guidelines
Gitlab infrastructure team adds a reference to a troubleshooting markdown.
See:
Effects
This is what alert name should be about as this is the first thing operator sees when receives alert. Additionally, this could be enhanced by summary
annotation field.
Description/cause
In prometheus community this is usually done with either message
field (for example in kubernetes-monitoring/kubernetes-mixin project or with description
field (example in node-mixin project).
Resolution guidelines
This is basically a runbook/SOP. For example kubernetes-mixin project includes those as runbook_url
as a field in alert annotations.
Such runbooks are located in one file, and links are made to specific anchors.
This field is usually the most problematic one, as creating a runbook needs a deep knowledge of the system itself.
Essentially those are problems already solved by the prometheus community.