ooni/sysadmin

Monitoring of hosts was down for several weeks

hellais opened this issue · 5 comments

Impact:

  • Metrics from the 14th of August were not scraped
  • Potentially lost critical alerts for the affected time period

Timeline:
12:37 CET Sept 2nd
[hellais]
It looks like we are out of disk space on amsmetadb:
/dev/mapper/amsmetadb-plpsql 1000G 1000G 9.4M 100% /srv/pl-psql
I don’t know why we did not get any alert for it
15:54 CET
For some reason it seems like we are only scraping brie.darkk.net.ru and no other machine with prometheus
15:56 CET

hellais runs ./play deploy-prometheus.yml on top of dirty 4e139af (master)

15:59 CET
It’s now restored, though we lost several weeks of metrics it seems

What went well:

  • We eventually noticed

What went wrong:

  • We had no alert on missing node exporter metrics

What we should do to prevent

  • Add alert that checks if there is less than some number of node exporter scraped from some number of hosts

There's a metric for prometheus_sd_discovered_targets that will tell you how many targets per job are monitored.

You could have an alert like this:

- alert: FewerDiscoveredTargets
  expr: (prometheus_sd_discovered_targets - prometheus_sd_discovered_targets offset 1d) < -1

Could be noisy if you're intentionally removing targets.

Ah that's useful, thanks for the heads up.

I think that even just setting that to expr: (prometheus_sd_discovered_targets < 3) is good enough. Since we generate the monitoring rules from the ansible inventory I think the issue came to be due to the template not properly resolving the variable, so just checking for some sane lower bound of targets is good.

darkk commented

We had no alert on missing node exporter metrics

FYI https://openobservatory.slack.com/archives/C38EJ0CET/p1565616219084100 alerted on overall number of time-series dropping from 42k to 5k on the 12th of August. It looks like something relevant to the incident.

The alert was added after #189.

FYI https://openobservatory.slack.com/archives/C38EJ0CET/p1565616219084100 alerted on overall number of time-series dropping from 42k to 5k on the 12th of August. It looks like something relevant to the incident.

[FIRING] Lots of `scrape_samples_scraped` lost
Now ~ 5.658k, 24h ago ~ 42.45k.

Good call, I wrongly attributed this alert to having removed some hosts from monitoring due to consolidation of machines, but you are right this was probably it.