Entries for ProviderServiceAccount created in the ConfigMap as part of cluster creation are not cleaned up on cluster deletion

Question

Entries for ProviderServiceAccount created in the ConfigMap as part of cluster creation are not cleaned up on cluster deletion

sidharthsurana opened this issue 9 months ago · 1 comments

/kind bug

What steps did you take and what happened:
When we create a cluster, as part of that couple of ProviderServiceAccount CRs are created and processed. As part of that processing is adding entries to the ConfigMap pointed via the SERVICE_ACCOUNTS_CM_NAMESPACE and SERVICE_ACCOUNTS_CM_NAME params.

When deleting the cluster, we should clean the entries we added to the above ConfigMap. The bug here is currently even though the code is present for such deletion, but that never gets called due to the fact that we do not add an explicit finalizer on the VSphereCluster object wrt ProviderServiceAccount LCM. Thus the VSphereCluster CR simply gets deleted before the serviceaccount_controller get a chance to clean up the entries.

Following is the log snippet from the CAPV pod for a cluster tkc-01 that was deleted

I0319 22:12:47.292204       1 serviceaccount_controller.go:204] "capv-controller-manager/providerserviceaccount-controller/ns01/tkc-01-cpxsx: The control plane is not ready yet" err="failed to create client for Cluster ns01/tkc-01: Get \"https://192.168.124.1:6443/api?timeout=10s\": context deadline exceeded"
I0319 22:12:47.293146       1 serviceaccount_controller.go:145] "capv-controller-manager/providerserviceaccount-controller: vSphereCluster not found, won't reconcile" cluster="ns01/tkc-01-cpxsx-ccm"
I0319 22:12:47.293171       1 serviceaccount_controller.go:145] "capv-controller-manager/providerserviceaccount-controller: vSphereCluster not found, won't reconcile" cluster="ns01/tkc-01-cpxsx-pvcsi"
I0319 22:12:47.293314       1 serviceaccount_controller.go:145] "capv-controller-manager/providerserviceaccount-controller: vSphereCluster not found, won't reconcile" cluster="ns01/tkc-01-cpxsx"
I0319 22:14:47.293831       1 serviceaccount_controller.go:145] "capv-controller-manager/providerserviceaccount-controller: vSphereCluster not found, won't reconcile" cluster="ns01/tkc-01-cpxsx"

As seen above the vspherecluster gets removed before the controller could catch up and clean the enties.

This causes the ConfigMap to indefinitely grow and eventually hit the etcd storage limit of an individual key/value causing failures to create more clusters.

What did you expect to happen:
After cluster is deleted the corresponding entries in the provider service account ConfigMap should also be removed.

Anything else you would like to add:
Proposed solution:
We should add an explicit finalizer when creating the vspherecluster that could track the lcm aspect of the providerserviceaccounts. And the serviceaccount_controller.go should clear out this explicit finalizer after it has done the cleanup of entries for this cluster.

Environment:

Cluster-api-provider-vsphere version: Issue present in all the versions including main
Kubernetes version: (use kubectl version):
OS (e.g. from /etc/os-release):

fabriziopandini commented 9 months ago

/assign