Deletion and reconciliation of KAS LB breaks cluster
gottwald opened this issue · 1 comments
In DO when you delete a load-balancer, you're going to lose the assigned IP to that LB.
For example if you delete the KAS LB in the DO console then the controller will reconcile the LB and create a new one but this one will have a different IP address.
This leads to a broken cluster because the old IP address is still recorded in the cluster type as endpoint and is also used in all the certificates of all components. Even if you update the record in the cluster type, the old IP in the certs will still make all communication between components fail.
The recovery is a pretty rough combination of manually changing the IP in a lot of configs and visiting every worker, deleting the existing certs and getting kubeadm to re-issue new certs with the correct IP in them.
This issue should not arise during normal operations and normal reconciliations through the controller but it comes up whenever someone accidentally deletes the load-balancer (which is not too uncommon).
In order to mitigate this I'm proposing the option to use a managed DNS record that is kept up to date by the DO controller and use this as the endpoint host. That way the recovery would be much easier and even automatic.
I have a local PoC of this running that I could submit as PR.
WDYT?
Hi @gottwald
That's reasonable.
Yeah, all that might happen, someone who accidentally deletes a resource. We need to anticipate things like that.
I just quick look at your PR and it was good to make using the DNS record as controlplane endpoint optional. So it would be not a breaking changes.
I'll take another look at your PR later. Thanks!!