k3s-io/cluster-api-k3s

node deletion fail cause the subsequent node operation fail

mogliang opened this issue · 1 comments

In a rolling update kcp process, when deleting node timedout, capi print err log and proceed to delete machine.

E1121 02:20:51.361652       1 machine_controller.go:461] "Timed out deleting node" err="error deleting node mc2-control-plane-wvw2h: Delete \"https://mc2.mc2.akshybrid.io:6443/api/v1/nodes/mc2-control-plane-wvw2h?timeout=10s\": context deadline exceeded" controller="machine" controllerGroup="cluster.x-k8s.io" controllerKind="Machine" Machine="cluster-mc2/mc2-control-plane-mv7zc" namespace="cluster-mc2" name="mc2-control-plane-mv7zc" reconcileID=b941098f-e4aa-463c-b390-7d15138a0a03 KThreesControlPlane="cluster-mc2/mc2-control-plane" Cluster="cluster-mc2/mc2" Node="mc2-control-plane-wvw2h"

However, since cluster is in unhealthy state, the following add node operation get blocked.

Nov 21 02:49:37 mc2-control-plane-h9kvp k3s[5875]: time="2023-11-21T02:49:37Z" level=info msg="Waiting to retrieve kube-proxy configuration; server is not ready: https://127.0.0.1:6443/v1-k3s/readyz: 500 Internal Server Error"
Nov 21 02:49:37 mc2-control-plane-h9kvp k3s[5875]: time="2023-11-21T02:49:37Z" level=info msg="Adding member mc2-control-plane-h9kvp-b513763a=https://192.168.0.114:2380 to etcd cluster [mc2-control-plane-psph2-00576310=https://192.168.0.111:2380 mc2-control-plane-w>
Nov 21 02:49:37 mc2-control-plane-h9kvp k3s[5875]: {"level":"warn","ts":"2023-11-21T02:49:37.542Z","logger":"etcd-client","caller":"v3@v3.5.7-k3s1/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc00064e380/192.168.0.111:2>
Nov 21 02:49:37 mc2-control-plane-h9kvp k3s[5875]: time="2023-11-21T02:49:37Z" level=info msg="Waiting for other members to finish joining etcd cluster: etcdserver: unhealthy cluster"

To fix the issue, k3s cp provider shall be able to pass machineSpec nodeDeletionTimeout property.

link to #62