ooni/sysadmin

LetsEncrypt: make sure we have correct hooks

bassosimone opened this issue · 4 comments

See #304 and 6ceeee7. I believe one possible way to create that is to have a daily alert where we check whether:

sudo certbot renew --dry-run

fails for a specific system or not. @darkk, @hellais does this make sense?

darkk commented

I'm very skeptical that more alerts on capacity, upcoming failures and rare-errors are actually useful within the status quo (as I've said in https://github.com/ooni/sysadmin/issues/239#issuecomment-486853487).

We already have an alert for certificates about to expire: https://github.com/ooni/sysadmin/blob/master/ansible/roles/prometheus/files/alert_rules.yml#L136.

We should check why this was not triggered in the incident in question.

I agree with @darkk that we should not be having more alerts, but rather try to improve the ones we currently have to make them less noisy and more useful and actionable.

I believe the alerts were not appearing inside of slack because they had the info level severity label applied to them which wasn't triggering on slack.

In #336 I have fixed this.

I made some improvements to the letsencrypt role which should not create issues for this.

I am closing this. Please file a new if you think we need something else to do here.