Prometheus metrics for backups

Question

Prometheus metrics for backups

jescarri opened this issue 6 years ago · 6 comments

jescarri commented 6 years ago

Currently there's no exporter for the etcd-backup-operator.

Creating this issue to link it to a PR.

Answer 1 · 2019-07-09T00:06:41.000Z

as a side note, I took your branch and my branch and built my own backup operator image, works great.

Answer 2 · 2019-07-09T19:34:45.000Z

@jurgenweber yes, it's being running in our clusters for a few weeks w/o problems :)

Thanks for testing it!

Answer 3 · 2019-07-10T02:57:56.000Z

Do you have any prometheus alerts/grafana dashboards you mind sharing?

Also I am finding, if the pod gets restarted the metric will disappear until a new backup is run. You can see the metrics endpoint no longer has etcd_operator_backup.* metrics, but others still do return. I think it will need to return all the time, even if it has no value. Thoughts?

Answer 4 · 2019-07-10T06:39:38.000Z

@rjtsdl sure, I can do that.

I was planning to add readiness / liveness probes later, but you are right, simple handlers can do the trick.

Answer 5 · 2019-07-11T04:04:48.000Z

@jurgenweber this is what we have right now:

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  annotations:
  labels:
    prometheus: k8s
    role: alert-rules
  name: etcd-backup
spec:
  groups:
  - name: etcd-backup
    rules:
    - alert: etcdBackupControllerDown
      annotations:
        summary: etcd-backup pod {{ $labels.kubernetes_pod_name }} has
          been down for 5 minutes
      expr: absent(up{app="etcd-backup-operator"}) == 1
      for: 5m
      labels:
        class: availability
        severity: p1
    - alert: etcdBackupsNOTAttempted
      annotations:
        summary: No etcd-backups hasn't been attempted for the past 30 min
      expr: rate(etcd_operator_backups_attempt_total[30m]) * 1800 < 2
      for: 5m
      labels:
        class: availability
        severity: p2
    - alert: etcdBackupsNOTSucceeding
      annotations:
        summary: No etcd-backups have succeeded the past 30 min
      expr: rate(etcd_operator_backups_success_total[30m]) * 1800 < 2
      for: 5m
      labels:
        class: availability
        severity: p2

Answer 6 · 2019-07-11T05:34:23.000Z

yeah, my schedule is one an hour:

        - alert: VaultEtcdLastBackup
          annotations:
            summary: The last backup was more than 1 hour ago, please check it
            description: "vault etcd {{ $labels.instance }} backup too old"
          expr: time() - etcd_operator_backup_last_success{name="vault-etcd-cluster-backup",namespace="devops",release="amazing-dog"} > 3700
          for: 10m
          labels:
            severity: critical
        - alert: VaultEtcdBackupFailed
          annotations:
            summary: The backup has failed, we check for the last 3 successful backup attempts. Check that it is work.
            description: "vault etcd {{ $labels.instance }} backup has failed"
          expr: increase(etcd_operator_backups_success_total{name="vault-etcd-cluster-backup",namespace="devops",release="amazing-dog"}[3h]) == 3
          for: 10m
          labels:
            severity: critical