Payback159/openfero

De-duplication of Prometheus alerts

Opened this issue · 1 comments

Often times an alert can fire multiple times over the course of a single incident. Prometheus does support a lot of de-duplication and grouping, which is helpful. However it is possible for the same alert to resolve, then trigger again, when openfero already have an job running for it.

We should detect this scenario to reduce the amount of jobs and avoid duplicate job running.

Prometheus sends a groupKey which is a unique identifier for each alert group. Before starting a new job for a given alert, we should first check to see if an existing job is already running for a given groupKey. If one is already running, we should only log it that the alert triggered again rather than creating a new job.

The main decision here I think will be how we link a groupKey to a specific job. If we implement this feature we also need a way to synchronize this information between multiple OpenFero instances.

We could add the groupKey label in the generated job as e.g. label. When an OpenFero instance then gets an alert, the list of currently running jobs would have to be fetched from the Kubernetes API beforehand and checked whether a job with the groupKey is already running for this alert.

This would save us synchronization logic between OpenFero instances, and OpenFero scaling would remain simple.