Prometheus metrics for backups
jescarri opened this issue ยท 6 comments
Currently there's no exporter for the etcd-backup-operator.
Creating this issue to link it to a PR.
as a side note, I took your branch and my branch and built my own backup operator image, works great.
@jurgenweber yes, it's being running in our clusters for a few weeks w/o problems :)
Thanks for testing it!
Do you have any prometheus alerts/grafana dashboards you mind sharing?
Also I am finding, if the pod gets restarted the metric will disappear until a new backup is run. You can see the metrics endpoint no longer has etcd_operator_backup
.* metrics, but others still do return. I think it will need to return all the time, even if it has no value. Thoughts?
@rjtsdl sure, I can do that.
I was planning to add readiness / liveness probes later, but you are right, simple handlers can do the trick.
@jurgenweber this is what we have right now:
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
annotations:
labels:
prometheus: k8s
role: alert-rules
name: etcd-backup
spec:
groups:
- name: etcd-backup
rules:
- alert: etcdBackupControllerDown
annotations:
summary: etcd-backup pod {{ $labels.kubernetes_pod_name }} has
been down for 5 minutes
expr: absent(up{app="etcd-backup-operator"}) == 1
for: 5m
labels:
class: availability
severity: p1
- alert: etcdBackupsNOTAttempted
annotations:
summary: No etcd-backups hasn't been attempted for the past 30 min
expr: rate(etcd_operator_backups_attempt_total[30m]) * 1800 < 2
for: 5m
labels:
class: availability
severity: p2
- alert: etcdBackupsNOTSucceeding
annotations:
summary: No etcd-backups have succeeded the past 30 min
expr: rate(etcd_operator_backups_success_total[30m]) * 1800 < 2
for: 5m
labels:
class: availability
severity: p2
yeah, my schedule is one an hour:
- alert: VaultEtcdLastBackup
annotations:
summary: The last backup was more than 1 hour ago, please check it
description: "vault etcd {{ $labels.instance }} backup too old"
expr: time() - etcd_operator_backup_last_success{name="vault-etcd-cluster-backup",namespace="devops",release="amazing-dog"} > 3700
for: 10m
labels:
severity: critical
- alert: VaultEtcdBackupFailed
annotations:
summary: The backup has failed, we check for the last 3 successful backup attempts. Check that it is work.
description: "vault etcd {{ $labels.instance }} backup has failed"
expr: increase(etcd_operator_backups_success_total{name="vault-etcd-cluster-backup",namespace="devops",release="amazing-dog"}[3h]) == 3
for: 10m
labels:
severity: critical