BaritoLog/prometheus-runbooks

Consul nodes are down without any alert

Closed this issue · 0 comments

Consul nodes are down. Nodes can't communicate with each other, causing log lost. Monitoring is possible because when Consul nodes are down, consul_catalog_service_node_healthy metrics won't be reported by the Consul exporter.

Alerting can be done while Consul node metrics are missing, but, there is a problem: there're multiple factors causing consul_catalog_service_node_healthy metrics not being reported. Although, known factors so far:

  • Consul nodes are down. (what we wan't to monitor)
  • Consul exporter is down. (benefical to monitor, although isn't really in our aim)
  • Timeout when Consul exporter tries to contact Consul for gathering requests. (false-positive)

Suggested Action

Consul exporter's timeout configuration can be increased to minimalize false-positive, then alterting can be implemented.

Another false-positive can be known and handled after implementing the alert.