STONITH lambda removing site when site view is recovered
Closed this issue · 1 comments
mhajas commented
With the current implementation of lambda and alert firing the lambda is executed 4 times.
-
Each site with firing event when the
vendor_jgroups_site_view_status
metric decreases to 0 from each site. -
Each site when the alert is resolved.
Here is an example for two payloads for each:
Firing:
{
"receiver": "runner-keycloak/example-routing/default",
"status": "firing",
"alerts": [
{
"status": "firing",
"labels": {
"accelerator": ".......",
"alertname": "SiteOffline",
"namespace": "runner-keycloak",
"reporter": "gh-keycloak-a",
"severity": "critical",
"site": "gh-keycloak-b"
},
"annotations": {},
"startsAt": "2024-09-10T15:37:38.634Z",
"endsAt": "0001-01-01T00:00:00Z",
"generatorURL": ".......",
"fingerprint": "0a6fe94cfaf4f648"
}
],
"groupLabels": {
"accelerator": "........."
},
"commonLabels": {
"accelerator": "...........",
"alertname": "SiteOffline",
"namespace": "runner-keycloak",
"reporter": "gh-keycloak-a",
"severity": "critical",
"site": "gh-keycloak-b"
},
"commonAnnotations": {},
"externalURL": ".............",
"version": "4",
"groupKey": "{}/{alertname=\"SiteOffline\",namespace=\"runner-keycloak\"}:{accelerator=\".........................\"}",
"truncatedAlerts": 0
}
resolved:
{
"receiver": "runner-keycloak/example-routing/default",
"status": "resolved",
"alerts": [
{
"status": "resolved",
"labels": {
"accelerator": "...........",
"alertname": "SiteOffline",
"namespace": "runner-keycloak",
"reporter": "gh-keycloak-a",
"severity": "critical",
"site": "gh-keycloak-b"
},
"annotations": {},
"startsAt": "2024-09-10T15:37:38.634Z",
"endsAt": "2024-09-10T15:40:08.634Z",
"generatorURL": "...............",
"fingerprint": "0a6fe94cfaf4f648"
}
],
"groupLabels": {
"accelerator": ".............."
},
"commonLabels": {
"accelerator": ".................",
"alertname": "SiteOffline",
"namespace": "runner-keycloak",
"reporter": "gh-keycloak-a",
"severity": "critical",
"site": "gh-keycloak-b"
},
"commonAnnotations": {},
"externalURL": ".............",
"version": "4",
"groupKey": "{}/{alertname=\"SiteOffline\",namespace=\"runner-keycloak\"}:{accelerator=\"...............\"}",
"truncatedAlerts": 0
}
This is causing problems in our testsuite, because the lambda is not distinguishing between status == firing
and status == resolved
and if the endpoint group already contains both sites it kills one of them also on resolved event. The test suite then leaves the global accelerator in one site-only state.