STONITH lambda removing site when site view is recovered

Question

STONITH lambda removing site when site view is recovered

Closed this issue 2 months ago · 1 comments

With the current implementation of lambda and alert firing the lambda is executed 4 times.

Each site with firing event when the vendor_jgroups_site_view_status metric decreases to 0 from each site.
Each site when the alert is resolved.

Here is an example for two payloads for each:
Firing:

{
    "receiver": "runner-keycloak/example-routing/default",
    "status": "firing",
    "alerts": [
        {
            "status": "firing",
            "labels": {
                "accelerator": ".......",
                "alertname": "SiteOffline",
                "namespace": "runner-keycloak",
                "reporter": "gh-keycloak-a",
                "severity": "critical",
                "site": "gh-keycloak-b"
            },
            "annotations": {},
            "startsAt": "2024-09-10T15:37:38.634Z",
            "endsAt": "0001-01-01T00:00:00Z",
            "generatorURL": ".......",
            "fingerprint": "0a6fe94cfaf4f648"
        }
    ],
    "groupLabels": {
        "accelerator": "........."
    },
    "commonLabels": {
        "accelerator": "...........",
        "alertname": "SiteOffline",
        "namespace": "runner-keycloak",
        "reporter": "gh-keycloak-a",
        "severity": "critical",
        "site": "gh-keycloak-b"
    },
    "commonAnnotations": {},
    "externalURL": ".............",
    "version": "4",
    "groupKey": "{}/{alertname=\"SiteOffline\",namespace=\"runner-keycloak\"}:{accelerator=\".........................\"}",
    "truncatedAlerts": 0
}

resolved:

{
    "receiver": "runner-keycloak/example-routing/default",
    "status": "resolved",
    "alerts": [
        {
            "status": "resolved",
            "labels": {
                "accelerator": "...........",
                "alertname": "SiteOffline",
                "namespace": "runner-keycloak",
                "reporter": "gh-keycloak-a",
                "severity": "critical",
                "site": "gh-keycloak-b"
            },
            "annotations": {},
            "startsAt": "2024-09-10T15:37:38.634Z",
            "endsAt": "2024-09-10T15:40:08.634Z",
            "generatorURL": "...............",
            "fingerprint": "0a6fe94cfaf4f648"
        }
    ],
    "groupLabels": {
        "accelerator": ".............."
    },
    "commonLabels": {
        "accelerator": ".................",
        "alertname": "SiteOffline",
        "namespace": "runner-keycloak",
        "reporter": "gh-keycloak-a",
        "severity": "critical",
        "site": "gh-keycloak-b"
    },
    "commonAnnotations": {},
    "externalURL": ".............",
    "version": "4",
    "groupKey": "{}/{alertname=\"SiteOffline\",namespace=\"runner-keycloak\"}:{accelerator=\"...............\"}",
    "truncatedAlerts": 0
}

This is causing problems in our testsuite, because the lambda is not distinguishing between status == firing and status == resolved and if the endpoint group already contains both sites it kills one of them also on resolved event. The test suite then leaves the global accelerator in one site-only state.

Answer 1 · 2024-09-11T06:39:19.000Z

@mhajas - can you please also update the docs in the main repo, as they also include the lambda?