k0sctl is not handling node removal correctly

Question

k0sctl is not handling node removal correctly

danielskowronski opened this issue 9 months ago · 10 comments

danielskowronski commented 9 months ago

Versions

k0s: v1.28.3+k0s.0
k0sctl: v0.16.0 (7e8c272)

References

k0sctl node removal via Reset flag introduced here: #383 and #417
k0s docs on controller removal: https://docs.k0sproject.io/v1.28.4+k0s.0/remove_controller/#remove-a-controller
issue that seems to be related: #572

Summary

k0sctl silently does not support situation where controller entry disappears from k0sctl.yaml spec.hosts. Additionally, k0sctl treats k0sctl.yaml as truth source and does not check every controller. In worst-case scenario, it leads to split-brain: 2 clusters exist under mandatory HA for Control Plane. It is especially visible when workflows like Terraform expect controller removal to be a single run.

Details

All situations assume that the starting point is a set of 3 VMs for controllers (c01, c02, c03) and 4 VMs for workers (w01, ...) that have static IPs assigned. All VMs are fresh EC2s started from latest Ubuntu AMIs before every scenario is started. Static IPs ensure that ETCD problems are immediately visible. First controller is always targetted to ensure any issues with leadership are immediately visible.

Cluster health and membership can be verified by running k0s etcd member-list on each VM or checking if each controller can use k0s kubectl to obtain information about the cluster.

Scenario 1: procedure to leave ETCD is followed -> works

A cluster is created using k0sctl apply and verified to be working. If you follow the procedure to k0s etcd leave targetting leader and execute k0sctl apply right away, it'll fail, but the cluster is left intact.

After that, if you complete the full procedure to finish controller removal by executing k0s stop; k0s reset; reboot and leader and then run k0sctl apply, it'll work as expected by adding c01 back to the cluster. All operations are verified to be OK.

Scenario 2: controller is removed from k0sctl.yaml without any additional operations -> fails, but not catastrophically

A cluster is created using k0sctl apply and verified to be working. If you remove leader c01 from YAML file and run k0sctl apply it seems like it was removed from the cluster (e.g. log says, "2 controllers in total").

However, all 3 controllers remain in cluster and there's no outage.

If you re-add c01 entry to the hosts list in the same form as before and run k0sctl apply, it is "added back" to the cluster (log says "3 controllers in total"). Nothing changed and the cluster is still having 3 controllers and working fine.

Scenario 3: controller is externally wiped and k0sctl runs on unchanged file -> breaks cluster

A cluster is created using k0sctl apply and verified to be working. Leader VM is destroyed by external means, user may not be aware of that. Some new fresh VM exists with the same IP address (hostname may be different). k0sctl apply is executed on YAML file that wasn't changed or changes were made, but ssh section remains the same (i.e. hostname, environment could have been changed, for example by Terraform when EC2 is rebuilt). The effect is the following:

c01, which is VM that was destroyed previously, is still considered by k0sctl as the leader, since it's empty, k0sctl installs a new cluster there and attempts to join c02+c03; however, it fails; etcd installed there only recognizes c01 as a member; kubectl shows no workers
c02 and c03 are still a cluster which has registered old c01 membership registered; this is easy to diagnose if you set a hostname to be globally unique upon OS is installed (e.g. EC2 resource ID or VM birthdate); cluster on c02+c03 can't talk to c01 as etcd clusters are different; kubectl shows all workers
for entire operation of cluster set up following https://docs.k0sproject.io/v1.28.4+k0s.0/high-availability/ we now have some Load Balancer which points to 3 VMs: c01, c02 and c03 - all "healthy", so traffic directed at k8s control plane (both end-user and originating from k0s controllers) is randomly hitting fresh cluster with c01 or old degraded cluster with c02 and c03 -> this is effectively split-brain

This seems to be solved in the current main branch (39674d59b2f9546f83c74127dd64fb9dd553fad5), but only lowers severity to "fails, but not catastrophically". Re-runs of the command do not trigger recently added etcd leave.

Actual problems in one list

k0sctl does not handle controller being removed from spec file
- it only works if you set the flag "Reset", but it does not match "apply" command description and is completely not compatible with stateful systems like Terraform
k0sctl does not handle changes from the outside world
- in other words, it only works if it's the only thing which can manipulate any resource related to k0s cluster
- you have to manually detect drift and apply changes (e.g. k0s etcd leave) in such a way that real-world matches k0sctl.yaml before running "apply"
- new unreleased version only stops cluster from crashing but does not solve the problem of missing node-replacement capability
k0sctl blindly assumes that what spec says is a leader is always a leader
- what's missing is the ability to check all controllers what they think the cluster state is
- with v0.16.0, only the host that used to be leader when k0sctl was last run is validated
k0s/k0sctl rely solely on IP address to form etcd cluster
- maybe it should use Matadata.MachineID
- it seems like if you set ssh.address to hostname (which you can make globally unique) then ETCD cannot start

Answer 1 · 2023-12-11T14:31:32.000Z

Additionally, it seems like adding reset flag, running apply and then removing the controller from YAML followed by apply does not trigger working node removal from ETCD and ControlNode object.

The attached zip has 3 phases: bootstrap, reset leader and remove leader. All with input k0sctl.yaml, logs from k0stcl apply and kubectl get ControlNode -o yaml. Plus the final state of etcd memberships.

k0sctl_603_reset.zip

Answer 2 · 2023-12-13T10:03:31.000Z

k0sctl blindly assumes that what spec says is a leader is always a leader

It just goes through all controllers in the config and picks the first that has k0s running and isn't marked to be reset. If one can't be found, the first controller is used as "leader". There shouldn't be any special treatment for the leader, it's just a "randomly" picked controller that is used for running commands that need to be run on a controller.

		// Pick the first controller that reports to be running and persist the choice
		for _, h := range controllers {
			if !h.Reset && h.Metadata.K0sBinaryVersion != nil && h.Metadata.K0sRunningVersion != nil {
				s.k0sLeader = h
				break
			}
		}

		// Still nil?  Fall back to first "controller" host, do not persist selection.
		if s.k0sLeader == nil {
			return controllers.First()
		}

with v0.16.0, only the host that used to be leader when k0sctl was last run is validated

Hmm, validated how?

adding reset flag, running apply and then removing the controller from YAML followed by apply does not trigger working node removal from ETCD and ControlNode object

The ControlNode objects are autopilot's, so it seems deleting a kubernetes node does not trigger a removal from autopilot, I don't know how autopilot manages removed nodes.

I think k0sctl should maybe do etcd leave before/after kubectl delete node or maybe k0s reset should do that on its own?

This is btw automatically done when needed:

      environment:
        ETCD_UNSUPPORTED_ARCH: arm

Your arch seems to be arm64, 64-bit arm is supported on etcd 3.5.0+ which is included in k0s v1.22.1+k0s.0 and newer