ooni/sysadmin

API down for ~10h

hellais opened this issue · 1 comments

Impact: OONI Explorer and the API were not working for 10 hours

Detection: alert in #ooni-bots and human alert in #ooni-dev

Timeline UTC+3:
[00:36] AlertManager APP

[FIRING] amsapi.ooni.nu:9100 has 0% RAM available
[RESOLVED] amsapi.ooni.nu:9100 has 0% RAM available
[RESOLVED] http://nkvphnp3p6agi5qq.onion/bouncer/net-tests endpoint down
[FIRING] amsapi.ooni.nu:9100 had 1 OOMs in 5m
[FIRING] 6 task `collector_sensor_hkgcollectora` failures last week
Last `collector_sensor_hkgcollectora` failure 1d 22h ago.
[RESOLVED] amsapi.ooni.nu:9100 had 1 OOMs in 5m
[FIRING] api.ooni.io:443 is not `up`
[FIRING] https://measurements.ooni.torproject.org/api/v1/files endpoint down

[03:06] sukhbir@ooni-dev
Explorer is down?

[11:36] hellais@ooni-bots
Seems like API is down
[11:42] hellais@ooni-bots
runs ./play deploy-measurements.yml -t measurements on top of dirty 3a875f1 (master)
[11:45] hellais@ooni-bots
Hum:

ImportError: /usr/local/lib/python3.5/site-packages/misaka/_hoedown.abi3.so: failed to map segment from shared object: Cannot allocate memory

[11:55] hellais@ooni-bots
Ok it should be back online now
I was not able to restart the service trivially (or redeploy it cleanly), because it seems like docker was in some inconsistent state as well
I was getting this error:

fatal: [amsapi.ooni.nu]: FAILED! => {"changed": false, "msg": "Error starting container 17ee5ad863cd026155e506280988c5271b4c4451b8fb170f4f50eb67f3f1352a: 500 Server Error: Internal Server Error (\"{\"message\":\"endpoint with name oomsm-web already exists in network msm\"}\")"}

Or this error:

fatal: [amsapi.ooni.nu]: FAILED! => {"changed": false, "msg": "Error removing container e9206d3be1d7c572f99d6d919b21da3597e93771583f3c07f0d95f4a931ac521: 409 Client Error: Conflict (\"{\"message\":\"You cannot remove a running container e9206d3be1d7c572f99d6d919b21da3597e93771583f3c07f0d95f4a931ac521. Stop the container before attempting removal or force remove\"}\")"}

When running the playbook
To fix it I ended up restarting the docker service

What went well:

  • The alert was accurate and timely
  • We had been seeing low memory alerts on amsapi since a while

What went wrong:

  • We did not pay much attention to the low memory alerts on amsapi for weeks
  • amsapi host is a single point of failure

What could be done to prevent relapse and decrease impact:

  • Increase the memory of the VM hosting amsapi or evaluate how to reduce the memory footprint of the app
  • Have another host offering the API service behind a reverse proxy