/prometheus-alertmanager-cloudwatch-webhook

A solution to monitor Prometheus alerting pipeline using CloudWatch Alerts

Primary LanguageGoMIT LicenseMIT

The project provides application that exposes webhook to be used by Prometheus Alertmanager. The webhook is invoked by DeadMansSwitch alert from Prometheus and on every invocation it updates a metric in CloudWatch. You can use that metric to setup Cloudwatch Alert and get notified when Prometheus alerting pipeline is not healthy.

This project can also create multiple Cloudwatch Alerts. It uses the alert name from the Alertmanager webhook as the metric name.

You can find more information in this blog post.

Requirements

  1. IAM Role (to allow the app to put CloudWatch metrics)
  2. kube2iam (as default deployment uses kube2iam annotation)
  3. kustomize tool (or you can apply manifests manually)

Installation

Create required IAM role

Create new IAM role named k8s-alertmanager-cloudwatch-webhook with the following trust relationship:

{
  "Version": "2008-10-17",
  "Statement": [
    {
      "Action": "sts:AssumeRole",
      "Principal": {
        "AWS": [
          "arn:aws:iam::123456789012:role/nodes.cluster.name"
        ]
      },
      "Effect": "Allow"
    }
  ]
}

Replace principal ARN with the role your Kubernetes nodes use. Then set the policy to:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Action": [
        "cloudwatch:PutMetricData"
      ],
      "Effect": "Allow",
      "Resource": "*"
    }
  ]
}

Deploy application to kubernetes

git clone git@github.com:tomaszkiewicz/prometheus-alertmanager-cloudwatch-webhook.git
cd prometheus-alertmanager-cloudwatch-webhook/build/k8s
kustomize build | kubectl apply -f -

If you do not use kustomize you can apply deployment.yaml and service.yaml files manually. Please note that by default configuration is to deploy to monitoring namespace as it's also the case for Prometheus Operator.

Configure Alertmanager

Modify your configuration to include new receiver definition and new route. Here's sample configuration:

route:
  receiver: "slack"
  group_by:
  - severity
  - alertname
  routes:
  - receiver: "cloudwatch"
    match:
      alertname: DeadMansSwitch
    group_wait: 30s
    group_interval: 1m
    repeat_interval: 1m
  group_wait: 30s
  group_interval: 1m
  repeat_interval: 48h

receivers:
- name: "cloudwatch"
  webhook_configs:
  - url: "http://alertmanager-cloudwatch-webhook/webhook"
- name: "slack"
  ...

Configure CloudWatch Alert

Example alert definition in Terraform:

resource "aws_cloudwatch_metric_alarm" "p8s_dead_mans_switch" {
  alarm_name = "prometheus-alertmanager-pipeline-health"
  alarm_description = "This metric shows health of alerting pipeline"
  comparison_operator = "LessThanThreshold"
  evaluation_periods = "5"
  metric_name = "DeadMansSwitch"
  namespace = "Prometheus"
  period = "60"
  statistic = "Minimum"
  threshold = "1"
  treat_missing_data = "breaching"
  alarm_actions = ["${module.slack_alarm_notification.sns_topic_arn}"]
  ok_actions = ["${module.slack_alarm_notification.sns_topic_arn}"]
}

You should modify alarm_actions and ok_actions and set it to your CloudWatch alerting system.