API down for ~10h
hellais opened this issue · 1 comments
Impact: OONI Explorer and the API were not working for 10 hours
Detection: alert in #ooni-bots and human alert in #ooni-dev
Timeline UTC+3:
[00:36] AlertManager APP
[FIRING] amsapi.ooni.nu:9100 has 0% RAM available
[RESOLVED] amsapi.ooni.nu:9100 has 0% RAM available
[RESOLVED] http://nkvphnp3p6agi5qq.onion/bouncer/net-tests endpoint down
[FIRING] amsapi.ooni.nu:9100 had 1 OOMs in 5m
[FIRING] 6 task `collector_sensor_hkgcollectora` failures last week
Last `collector_sensor_hkgcollectora` failure 1d 22h ago.
[RESOLVED] amsapi.ooni.nu:9100 had 1 OOMs in 5m
[FIRING] api.ooni.io:443 is not `up`
[FIRING] https://measurements.ooni.torproject.org/api/v1/files endpoint down
[03:06] sukhbir@ooni-dev
Explorer is down?
[11:36] hellais@ooni-bots
Seems like API is down
[11:42] hellais@ooni-bots
runs ./play deploy-measurements.yml -t measurements
on top of dirty 3a875f1
(master
)
[11:45] hellais@ooni-bots
Hum:
ImportError: /usr/local/lib/python3.5/site-packages/misaka/_hoedown.abi3.so: failed to map segment from shared object: Cannot allocate memory
[11:55] hellais@ooni-bots
Ok it should be back online now
I was not able to restart the service trivially (or redeploy it cleanly), because it seems like docker was in some inconsistent state as well
I was getting this error:
fatal: [amsapi.ooni.nu]: FAILED! => {"changed": false, "msg": "Error starting container 17ee5ad863cd026155e506280988c5271b4c4451b8fb170f4f50eb67f3f1352a: 500 Server Error: Internal Server Error (\"{\"message\":\"endpoint with name oomsm-web already exists in network msm\"}\")"}
Or this error:
fatal: [amsapi.ooni.nu]: FAILED! => {"changed": false, "msg": "Error removing container e9206d3be1d7c572f99d6d919b21da3597e93771583f3c07f0d95f4a931ac521: 409 Client Error: Conflict (\"{\"message\":\"You cannot remove a running container e9206d3be1d7c572f99d6d919b21da3597e93771583f3c07f0d95f4a931ac521. Stop the container before attempting removal or force remove\"}\")"}
When running the playbook
To fix it I ended up restarting the docker service
What went well:
- The alert was accurate and timely
- We had been seeing low memory alerts on amsapi since a while
What went wrong:
- We did not pay much attention to the low memory alerts on amsapi for weeks
- amsapi host is a single point of failure
What could be done to prevent relapse and decrease impact:
- Increase the memory of the VM hosting amsapi or evaluate how to reduce the memory footprint of the app
- Have another host offering the API service behind a reverse proxy