Consul nodes are down without any alert
Closed this issue · 0 comments
nieltg commented
Consul nodes are down. Nodes can't communicate with each other, causing log lost. Monitoring is possible because when Consul nodes are down, consul_catalog_service_node_healthy
metrics won't be reported by the Consul exporter.
Alerting can be done while Consul node metrics are missing, but, there is a problem: there're multiple factors causing consul_catalog_service_node_healthy
metrics not being reported. Although, known factors so far:
- Consul nodes are down. (what we wan't to monitor)
- Consul exporter is down. (benefical to monitor, although isn't really in our aim)
- Timeout when Consul exporter tries to contact Consul for gathering requests. (false-positive)
Suggested Action
Consul exporter's timeout configuration can be increased to minimalize false-positive, then alterting can be implemented.
Another false-positive can be known and handled after implementing the alert.