ooni/sysadmin

registry.proteus.test.ooni.io was expired for some hours

bassosimone opened this issue · 4 comments

Detection

@bassosimone noticed this during measurement-kit/mkorchestra#7.

Impact

@bassosimone could not merge measurement-kit/mkorchestra#7 because integration tests were failing. As this library is not stable and not used by MK yet, the impact of this incident has been basically minimal.

Timeline

2019-03-05T12:51:06Z: first renew failure per /var/log/letsencrypt/letsencrypt.log.*
2019-04-04T11:16:30Z: Not After for the expiring ceritificate
2019-04-11T00:00:00Z: the incident was already ongoing according to our monitoring
2019-04-26T11:23:00Z: @bassosimone notices that measurement-kit/mkorchestra#7 is failing because of an expired certificate.

What went wrong

  1. alerts have been ignored by whoever had access for a long time (see timeline)

  2. we are having issues with certbot renewal for quite some time now, this is known but we did not react because a testing system is probably considered not so relevant

How to fix it

@hellais said:

I would rerun the playbook on just the test instance and check if that fixes it
If not I would login to the machine and check logs to see what is wrong and adjust the playbook accordingly

What could be done to prevent relapse and decrease impact

Currently not clear to @bassosimone.

Remediation: attempt 1

With:

diff --git a/ansible/deploy-ooni-registry.yml b/ansible/deploy-ooni-registry.yml
index 1216485..d81aaff 100644
--- a/ansible/deploy-ooni-registry.yml
+++ b/ansible/deploy-ooni-registry.yml
@@ -1,6 +1,7 @@
 - hosts: registry.proteus.test.ooni.io
   roles:
     - role: letsencrypt
+      tags: letsencrypt
       letsencrypt_domains: ["registry.proteus.test.ooni.io"]
       letsencrypt_nginx: yes
     - role: ooni-registry
@@ -13,6 +14,7 @@
 - hosts: registry.proteus.ooni.io
   roles:
     - role: letsencrypt
+      tags: letsencrypt
       letsencrypt_domains: ["registry.proteus.ooni.io"]
       letsencrypt_nginx: yes
     - role: ooni-registry

Run:

./play -l registry.proteus.test.ooni.io -t letsencrypt ./deploy-ooni-registry.yml

Results:

PLAY [registry.proteus.test.ooni.io] **********************************************************************************

TASK [Gathering Facts] ************************************************************************************************
ok: [registry.proteus.test.ooni.io]

TASK [letsencrypt : Add Debian stable backports repository] ***********************************************************
An exception occurred during task execution. To see the full traceback, use -vvv. The error was: , E:Some index files failed to download. They have been ignored, or old ones used instead.
fatal: [registry.proteus.test.ooni.io]: FAILED! => {"changed": false, "module_stderr": "Traceback (most recent call last):\n  File \"/tmp/ansible_S41dxW/ansible_module_apt_repository.py\", line 556, in <module>\n    main()\n  File \"/tmp/ansible_S41dxW/ansible_module_apt_repository.py\", line 544, in main\n    cache.update()\n  File \"/usr/lib/python2.7/dist-packages/apt/cache.py\", line 443, in update\n    raise FetchFailedException(e)\napt.cache.FetchFailedException: W:Failed to fetch http://httpredir.debian.org/debian/dists/jessie-backports/main/binary-amd64/Packages  404  Not Found [IP: 151.101.196.204 80]\n, E:Some index files failed to download. They have been ignored, or old ones used instead.\n", "module_stdout": "", "msg": "MODULE FAILURE", "rc": 1}
	to retry, use: --limit @/Users/sbasso/src/github.com/ooni/sysadmin/ansible/deploy-ooni-registry.retry

PLAY RECAP ************************************************************************************************************
registry.proteus.test.ooni.io : ok=1    changed=0    unreachable=0    failed=1

Result: FAILURE 😠

Remediation: attempt 2

See 6ceeee7

Result: SUCCESS 🎉

Updated the title to reflect that the incident is now solved and removed the bug tag.