FairwindsOps/gemini

Monitor and track the status of snapshots

liorfranko opened this issue · 4 comments

Hi,

Can you expose metrics that show the status of snapshots?
We want to create a dashboard and alerts to make sure snapshots don't fail.

Thanks,

rbren commented

Thanks for the request! What kind of metrics would you want to show that aren't available via kubectl? You can see the snapshot status that way

Using kubectl is nice, but I want to set alerts and not check them manually.
Example of metrics:
Number of snapshots
Status of each snapshot
If they're ready or not
Age of snapshot

This reminds me of https://kubernetes.io/blog/2021/04/16/volume-health-monitoring-alpha-update/.

However, I think for the controller the most sensible thing to add are Prometheus metrics for things like a snapshot failing to create, number of active create/restore processes, total number of PVCs and snapshots managed by the controller.

Can we re-open this? Having metrics to understand that the Gemini controller is working and that our Gemini resources are valid (ie, point to real PVCs) is critical. I can't see any current status output on the SnapshotGroup resource that we can use to get an indication of whether or not the controller is working and the configuration is valid.