Bug: Unable to Remove Autoscaling Node from k8s Cluster with >10 Autoscaling Nodes
JKBGIT1 opened this issue · 0 comments
Current Behaviour
Claudie failed to remove the node from the k8s cluster on a downscale request when there were more than 10 nodes in the autoscaling nodepool.
After a downscaling request, the node with the name compute01-ccx23-auto-fy7ww3o-10
was removed from the Longhorn, and its VM was destroyed by the terraformer
but wasn't removed from the k8s cluster. The only error log I could find was in the cluster-autoscaler
.
I0424 08:09:53.735469 1 actuator.go:161] Scale-down: removing empty node "compute01-ccx23-auto-fy7ww3o-10"
E0424 08:09:58.749463 1 actuator.go:423] Scale-down: couldn't delete empty node, , status error: failed to delete compute01-ccx23-auto-fy7ww3o-10: rpc error: code = Unknown desc = failed to update nodepool compute01-ccx23-auto-fy7ww3o : error while updating the state in the Claudie : rpc error: code = Unknown desc = the project default-wox01 is currently in the build stage
Besides that, there were some odd info logs in the kuber
. When kuber
deletes a node the logs look as follows.
2024-04-24T07:45:23Z INF Deleting node <node-name> from nodes.longhorn.io from cluster cluster=wox01-cluster-qy5w5zl module=kuber
2024-04-24T07:45:24Z INF Deleting node <node-name> from k8s cluster cluster=wox01-cluster-qy5w5zl module=kuber
Down below are the logs for compute01-ccx23-auto-fy7ww3o-10
deletion. A different node name is in each log, even though it should be the same. For node compute01-ccx23-auto-fy7ww3o-10
the log message aligned with the cluster state (the node was indeed deleted from the Longhorn) but compute01-ccx23-auto-fy7ww3o-1
was still part of the k8s cluster and had running workloads.
2024-04-24T07:45:23Z INF Deleting node compute01-ccx23-auto-fy7ww3o-10 from nodes.longhorn.io from cluster cluster=wox01-cluster-qy5w5zl module=kuber
2024-04-24T07:45:24Z INF Deleting node compute01-ccx23-auto-fy7ww3o-1 from k8s cluster cluster=wox01-cluster-qy5w5zl module=kuber
There's probably some unexpected behavior when the number suffix for nodes in an autoscaling nodepool exceeds one-digit numbers.
Expected Behaviour
Claudie should successfully remove the node from the autoscaling nodepool on a downscaling request.
Steps To Reproduce
I haven't tried it myself, but the following steps should work.
- Create a k8s cluster with autoscaling nodepool.
- Deploy enough workload to have at least 10 nodes in the autoscaling nodepool.
- Start downscaling the autoscaling nodepool by deleting the workload. This problem should occur when Claudie tries to remove a node, whose number suffix consists of two digits. There is a chance, that it will only occur if Claudie tries to remove a node whose number suffix is
10
, while the node with the number suffix1
is still up.
Anything else to note
A culprit might be a substring match.
claudie/internal/utils/cluster.go
Line 136 in 3bab709