m-lab/prometheus-support

Alert `MachineRunningWithoutK8sNode` does not catch everything

nkinkade opened this issue · 0 comments

We recently had a situation where a site, MAA02, went down due to some sort of transit provider issue. The machines were always running, but were inaccessible. The issue went on for around 3 weeks. Meanwhile, kubernetes had all MAA02 nodes marked as NotReady and all pods marked as down. This scenario potentially causes alerts and potentially blocks RollingUpdates. To get around it, the nodes were deleted from k8s. At some point the network issue was corrected, and what we were left with is "zombie" nodes... running nodes that k8s does not know about. We have an alert to detect these "zombine" nodes, but the alert doesn't catch everything. It looks for all nodes where the blackbox_exporter thinks ndt_ssl is up, yet k8s doesn't know about the node. However, in the case of MAA02, NDT was not running on any of the nodes... apparently the kubelet had killed all the containers for one reason or another.

We should modify the alert to look for nodes where some core machine service is up, for example, SSH, yet k8s does not know about the node.