The Configure Alertmanager Operator was created for the OpenShift Dedicated platform to dynamically manage Alertmanager configurations based on the presence or absence of secrets containing a Pager Duty RoutingKey and Dead Man's Snitch URL. When the secret is created/updated/deleted, the associated Receiver and Route will be created/updated/deleted within the Alertmanager config.
The operator contains the following components:
- Secret controller: watches the
openshift-monitoring
namespace for any changes to Secrets namedalertmanager-main
,pd-secret
ordms-secret
. - Types library: these types are imported from the Alertmanager Config library and pared down to suit our config needs. (Since their library is intended for internal use only).
To avoid alert noise while a cluster is in the early stages of being installed and configured, this operator waits to configure Pager Duty -- effectively silencing alerts -- until a predetermined set of health checks, performed by osd-cluster-ready, has completed.
The Configure Alertmanager Operator exposes the following Prometheus metrics:
- pd_secret_exists: indicates that a Secret named
pd-secret
exists in theopenshift-monitoring
namespace. - dms_secret_exists: indicates that a Secret named
dms-secret
exists in theopenshift-monitoring
namespace. - am_secret_contains_pd: indicates the Pager Duty receiver is present in alertmanager.yaml.
- am_secret_contains_dms: indicates the Dead Man's Snitch receiver is present in alertmanager.yaml.
The following alerts are added to Prometheus as part of configure-alertmanager-operator:
- Mismatch between DMS secret and DMS Alertmanager config.
- Mismatch between PD secret and PD Alertmanager config.
- Alertmanager config secret does not exist.
Tips for testing on a personal cluster:
You may build (make docker-build
) and push (make docker-push
) the operator image to a personal repository by overriding components of the image URI:
IMAGE_REGISTRY
overrides the registry (default:quay.io
)IMAGE_REPOSITORY
overrides the organization (default:app-sre
)IMAGE_NAME
overrides the repository name (default:managed-cluster-validating-webhooks
)OPERATOR_IMAGE_TAG
overrides the image tag. (By default this is generated based on the current commit of your local clone of the git repository; butmake docker-build
will also always taglatest
)
For example, to build, tag, and push quay.io/my-user/configure-alertmanager-operator:latest
, you can run:
make IMAGE_REPOSITORY=my-user docker-build docker-push
Note: This step requires elevated permissions
This operator is managed by OLM, so you must switch that off, or your changes to the operator's Deployment will be overwritten:
oc scale deploy/cluster-version-operator --replicas=0 -n openshift-cluster-version
oc scale deploy/olm-operator --replicas=0 -n openshift-operator-lifecycle-manager
NOTE: Don't forget to revert these changes when you have finished testing:
oc scale deploy/olm-operator --replicas=1 -n openshift-operator-lifecycle-manager
oc scale deploy/cluster-version-operator --replicas=1 -n openshift-cluster-version
Edit the operator's deployment (oc edit deployment configure-alertmanager-operator -n openshift-monitoring
), replacing the image:
with the URI of the image you built above. The deployment will automatically delete and replace the running pod.
NOTE: If you are testing coordination with the osd-cluster-ready job, you may need to set the MAX_CLUSTER_AGE_MINUTES
environment variable in the deployment's configure-alertmanager-operator
container definition.
For example, to ensure the osd-cluster-ready Job is checked in a cluster less than 1048576 minutes (~two years) old:
containers:
- command:
- configure-alertmanager-operator
env:
- name: WATCH_NAMESPACE
valueFrom:
fieldRef:
apiVersion: v1
fieldPath: metadata.namespace
- name: POD_NAME
valueFrom:
fieldRef:
apiVersion: v1
fieldPath: metadata.name
- name: OPERATOR_NAME
value: configure-alertmanager-operator
### Add this entry ###
- name: MAX_CLUSTER_AGE_MINUTES
value: "1048576"
image: quay.io/2uasimojo/configure-alertmanager-operator:latest
imagePullPolicy: Always
name: configure-alertmanager-operator
resources: {}
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File