LetsEncrypt: make sure we have correct hooks
bassosimone opened this issue · 4 comments
I'm very skeptical that more alerts on capacity, upcoming failures and rare-errors are actually useful within the status quo (as I've said in https://github.com/ooni/sysadmin/issues/239#issuecomment-486853487).
We already have an alert for certificates about to expire: https://github.com/ooni/sysadmin/blob/master/ansible/roles/prometheus/files/alert_rules.yml#L136.
We should check why this was not triggered in the incident in question.
I agree with @darkk that we should not be having more alerts, but rather try to improve the ones we currently have to make them less noisy and more useful and actionable.
I believe the alerts were not appearing inside of slack because they had the info level severity label applied to them which wasn't triggering on slack.
In #336 I have fixed this.
I made some improvements to the letsencrypt role which should not create issues for this.
I am closing this. Please file a new if you think we need something else to do here.