berops/claudie

Bug: Unable to Remove Autoscaling Node from k8s Cluster with >10 Autoscaling Nodes

JKBGIT1 opened this issue · 0 comments

Current Behaviour

Claudie failed to remove the node from the k8s cluster on a downscale request when there were more than 10 nodes in the autoscaling nodepool.

After a downscaling request, the node with the name compute01-ccx23-auto-fy7ww3o-10 was removed from the Longhorn, and its VM was destroyed by the terraformer but wasn't removed from the k8s cluster. The only error log I could find was in the cluster-autoscaler.

I0424 08:09:53.735469       1 actuator.go:161] Scale-down: removing empty node "compute01-ccx23-auto-fy7ww3o-10"
E0424 08:09:58.749463       1 actuator.go:423] Scale-down: couldn't delete empty node, , status error: failed to delete compute01-ccx23-auto-fy7ww3o-10: rpc error: code = Unknown desc = failed to update nodepool compute01-ccx23-auto-fy7ww3o : error while updating the state in the Claudie : rpc error: code = Unknown desc = the project default-wox01 is currently in the build stage

Besides that, there were some odd info logs in the kuber. When kuber deletes a node the logs look as follows.

2024-04-24T07:45:23Z INF Deleting node <node-name> from nodes.longhorn.io from cluster cluster=wox01-cluster-qy5w5zl module=kuber
2024-04-24T07:45:24Z INF Deleting node <node-name> from k8s cluster cluster=wox01-cluster-qy5w5zl module=kuber

Down below are the logs for compute01-ccx23-auto-fy7ww3o-10 deletion. A different node name is in each log, even though it should be the same. For node compute01-ccx23-auto-fy7ww3o-10 the log message aligned with the cluster state (the node was indeed deleted from the Longhorn) but compute01-ccx23-auto-fy7ww3o-1 was still part of the k8s cluster and had running workloads.

2024-04-24T07:45:23Z INF Deleting node compute01-ccx23-auto-fy7ww3o-10 from nodes.longhorn.io from cluster cluster=wox01-cluster-qy5w5zl module=kuber
2024-04-24T07:45:24Z INF Deleting node compute01-ccx23-auto-fy7ww3o-1 from k8s cluster cluster=wox01-cluster-qy5w5zl module=kuber

There's probably some unexpected behavior when the number suffix for nodes in an autoscaling nodepool exceeds one-digit numbers.

Expected Behaviour

Claudie should successfully remove the node from the autoscaling nodepool on a downscaling request.

Steps To Reproduce

I haven't tried it myself, but the following steps should work.

  1. Create a k8s cluster with autoscaling nodepool.
  2. Deploy enough workload to have at least 10 nodes in the autoscaling nodepool.
  3. Start downscaling the autoscaling nodepool by deleting the workload. This problem should occur when Claudie tries to remove a node, whose number suffix consists of two digits. There is a chance, that it will only occur if Claudie tries to remove a node whose number suffix is 10, while the node with the number suffix 1 is still up.

Anything else to note

A culprit might be a substring match.

if realNodeName := utils.FindName(realNodeNames, worker); realNodeName != "" {

if strings.Contains(name, n) {