ooni/sysadmin

Incident: blocked pipeline on 2019-12-10

FedericoCeratto opened this issue · 1 comments

Timeline (times in UTC+1):
federico 13:38
oh
https://mon.ooni.nu/prometheus/graph?g0.range_input=12h&g0.expr=8%20*%20node_filesystem_avail%20-%207%20*%20(node_filesystem_avail%20OFFSET%201d)%20%3C%200&g0.tab=0
false alarm?

hellais 16:05
Probably related to an increased rate of msmts
@federico it actually is true: https://mon.ooni.nu/prometheus/graph?g0.range_input=1d&g0.expr=(8%20*%20node_filesystem_avail%7Binstance%3D%22mia-ps2.ooni.nu%3A9100%22%2Cmountpoint%3D%22%2F%22%7D-%207%20*%20(node_filesystem_avail%7Binstance%3D%22mia-ps2.ooni.nu%3A9100%22%7D%20offset%201d))&g0.tab=0
Hum, according to: https://mon.ooni.nu/grafana/d/AE8tFfxWk/collectors-disk-activity?orgId=1&from=now-2d&to=now
The file count on mia-ps2 is going up

18:33
hellais runs ./play deploy-pipeline.yml on top of clean 06b8e01 (master)

Which implements the fix: ba0392d

The root cause of the issue turned out to be the fact that the newly deployed ams-ps2 collector, did not have all the required directories properly created and hence when the daily cronjob doing the renaming of the files (/srv/collector/bin/daily-tasks.sh) found an empty file it was not able to move it to the correct destination.

What went well:

  • Alerting was useful and allowed us to spot the issue

What went wrong:

  • The incident actually had been going on since the 3rd of December (when this collector was deployed) and we did not notice until the disk space problem

What we should do to prevent relapse:

I updated the above comment with a timeline and next steps.

I am closing this incident issue as we have documented the next steps as issues.