samber/awesome-prometheus-alerts

Refacto: write more accurate descriptions for faster troubleshooting

samber opened this issue · 3 comments

Example:

From:

- name: Prometheus rule evaluation failures
  description: 'Prometheus encountered {{ $value }} rule evaluation failures.'
  query: 'increase(prometheus_rule_evaluation_failures_total[3m]) > 0'

To:

- name: Prometheus rule evaluation failures
  description: 'Prometheus encountered {{ $value }} rule evaluation failures, leading to potentially ignored alerts.'
  query: 'increase(prometheus_rule_evaluation_failures_total[3m]) > 0'

An effect field would enable us to improve alert template.

I would welcome an effect field. I've solved this locally by including an effect and it's very helpful to reduce the size of the description when only that's required (on a status board) but include specific resolutions in slack messages for example.

e.g.
Screenshot 2020-04-30 at 14 22 15

We can probably find a balance between:

  • Description/cause
  • Effects
  • Resolution guidelines

Gitlab infrastructure team adds a reference to a troubleshooting markdown.

See:

Effects

This is what alert name should be about as this is the first thing operator sees when receives alert. Additionally, this could be enhanced by summary annotation field.

Description/cause

In prometheus community this is usually done with either message field (for example in kubernetes-monitoring/kubernetes-mixin project or with description field (example in node-mixin project).

Resolution guidelines

This is basically a runbook/SOP. For example kubernetes-mixin project includes those as runbook_url as a field in alert annotations.

Such runbooks are located in one file, and links are made to specific anchors.

This field is usually the most problematic one, as creating a runbook needs a deep knowledge of the system itself.


Essentially those are problems already solved by the prometheus community.