Azure/azure-service-operator

Bug: FleetsMember is unable to handle migration to another Fleet Manager.

daftping opened this issue · 1 comments

Version of Azure Service Operator

mcr.microsoft.com/k8s/azureserviceoperator:v2.8.0

Describe the bug
FleetsMember is unable to handle migration to another Fleet Manager.
When FleetsMember.spec.owner.armId field is updated, the resource is stuck in ClusterAlreadyJoinedAnotherFleet state.

To Reproduce
Create FleetsMember resource joining a FleetManager
Update spec.owner.armId field to point to another FleetManager

Expected behavior
Resource should handle cluster migration to another Fleet Manager (unjoin old Fleet Manager, join new Fleet Manager), if it is not possible, relevant fields should be immutable.

kubectl describe fleetsmember.containerservice.azure.com/cluster-name-b
Name:         cluster-name-b
Namespace:    default
Labels:       <none>
Annotations:  serviceoperator.azure.com/credential-from: aso-credentials
              serviceoperator.azure.com/latest-reconciled-generation: 4
              serviceoperator.azure.com/operator-namespace: capz-system
              serviceoperator.azure.com/resource-id:
                /subscriptions/<reducted>/resourceGroups/alex-fleet-a/providers/Microsoft.containerservice/fleets/alex-a/members...
API Version:  containerservice.azure.com/v1api20230315preview
Kind:         FleetsMember
Metadata:
  Creation Timestamp:  2024-08-02T19:51:08Z
  Finalizers:
    serviceoperator.azure.com/finalizer
  Generation:  4
  Owner References:
    API Version:           infrastructure.cluster.x-k8s.io/v1alpha1
    Block Owner Deletion:  true
    Controller:            true
    Kind:                  AzureASOManagedControlPlane
    Name:                  cluster-name-b
    UID:                   99941c17-8884-4106-84de-02e22b72bfbf
  Resource Version:        4235112
  UID:                     d8b72164-d059-488a-a4d8-719ff32130a1
Spec:
  Azure Name:  cluster-name-b
  Cluster Resource Reference:
    Group:  containerservice.azure.com
    Kind:   ManagedCluster
    Name:   cluster-name-b
  Group:    default
  Owner:
    Arm Id:  /subscriptions/<reducted>/resourceGroups/alex-fleet-b/providers/Microsoft.ContainerService/fleets/alex-fleet-b
Status:
  Cluster Resource Id:  /subscriptions/<reducted>/resourceGroups/cluster-name-b/providers/Microsoft.ContainerService/managedClusters/cluster-name-b
  Conditions:
    Last Transition Time:  2024-08-06T18:26:32Z
    Message:               One cluster can only join one fleet. The given cluster /subscriptions/<reducted>/resourceGroups/cluster-name-b/providers/Microsoft.ContainerService/managedClusters/cluster-name-b already joined another fleet /subscriptions/<reducted>/resourceGroups/alex-fleet-a/providers/Microsoft.ContainerService/fleets/alex-a. If you want to move the cluster to this fleet, leave the other fleet and try again. Resource ID: "/subscriptions/<reducted>/resourceGroups/alex-fleet-b/providers/Microsoft.ContainerService/fleets/alex-fleet-b/members/cluster-name-b". Correlation ID: "8800d607-34e8-4d4b-b5e5-2db5a2aa8566". Operation ID: "417decdf-fa59-4e04-b701-f1c87e398998": PUT https://management.azure.com/subscriptions/<reducted>/resourceGroups/alex-fleet-b/providers/Microsoft.ContainerService/fleets/alex-fleet-b/members/cluster-name-b
--------------------------------------------------------------------------------
RESPONSE 400: 400 Bad Request
ERROR CODE: ClusterAlreadyJoinedAnotherFleet
--------------------------------------------------------------------------------
{
  "error": {
    "code": "ClusterAlreadyJoinedAnotherFleet",
    "message": "One cluster can only join one fleet. The given cluster /subscriptions/<reducted>/resourceGroups/cluster-name-b/providers/Microsoft.ContainerService/managedClusters/cluster-name-b already joined another fleet /subscriptions/<reducted>/resourceGroups/alex-fleet-a/providers/Microsoft.ContainerService/fleets/alex-a. If you want to move the cluster to this fleet, leave the other fleet and try again. Resource ID: \"/subscriptions/<reducted>/resourceGroups/alex-fleet-b/providers/Microsoft.ContainerService/fleets/alex-fleet-b/members/cluster-name-b\". Correlation ID: \"8800d607-34e8-4d4b-b5e5-2db5a2aa8566\". Operation ID: \"417decdf-fa59-4e04-b701-f1c87e398998\""
  }
}
--------------------------------------------------------------------------------

    Observed Generation:  4
    Reason:               ClusterAlreadyJoinedAnotherFleet
    Severity:             Error
    Status:               False
    Type:                 Ready
  E Tag:                  "0200afc4-0000-4d00-0000-66b2689b0000"
  Group:                  default
  Id:                     /subscriptions/<reducted>/resourceGroups/alex-fleet-a/providers/Microsoft.ContainerService/fleets/alex-a/members/cluster-name-b
  Name:                   cluster-name-b
  Provisioning State:     Succeeded
  System Data:
    Created At:             2024-08-06T18:16:52.2228076Z
    Created By:             98828e3b-4b1e-4fc7-850e-faf7f197dc15
    Created By Type:        Application
    Last Modified At:       2024-08-06T18:16:52.2228076Z
    Last Modified By:       98828e3b-4b1e-4fc7-850e-faf7f197dc15
    Last Modified By Type:  Application
  Type:                     Microsoft.ContainerService/fleets/members
Events:
  Type     Reason                     Age                    From                    Message
  ----     ------                     ----                   ----                    -------
  Normal   BeginCreateOrUpdate        13m (x4 over 178m)     FleetsMemberController  Successfully sent resource to Azure with ID "/subscriptions/<reducted>/resourceGroups/alex-fleet-a/providers/Microsoft.containerservice/fleets/alex-a/members/cluster-name-b"
  Normal   CredentialFrom             3m35s (x17 over 178m)  FleetsMemberController  Using credential from "default/aso-credentials"
  Warning  CreateOrUpdateActionError  3m33s                  FleetsMemberController  Reason: ClusterAlreadyJoinedAnotherFleet, Severity: Error, RetryClassification: RetrySlow, Cause: One cluster can only join one fleet. The given cluster /subscriptions/<reducted>/resourceGroups/cluster-name-b/providers/Microsoft.ContainerService/managedClusters/cluster-name-b already joined another fleet /subscriptions/<reducted>/resourceGroups/alex-fleet-a/providers/Microsoft.ContainerService/fleets/alex-a. If you want to move the cluster to this fleet, leave the other fleet and try again. Resource ID: "/subscriptions/<reducted>/resourceGroups/alex-fleet-b/providers/Microsoft.ContainerService/fleets/alex-fleet-b/members/cluster-name-b". Correlation ID: "8800d607-34e8-4d4b-b5e5-2db5a2aa8566". Operation ID: "417decdf-fa59-4e04-b701-f1c87e398998": PUT https://management.azure.com/subscriptions/<reducted>/resourceGroups/alex-fleet-b/providers/Microsoft.ContainerService/fleets/alex-fleet-b/members/cluster-name-b
--------------------------------------------------------------------------------
RESPONSE 400: 400 Bad Request
ERROR CODE: ClusterAlreadyJoinedAnotherFleet
--------------------------------------------------------------------------------
{
  "error": {
    "code": "ClusterAlreadyJoinedAnotherFleet",
    "message": "One cluster can only join one fleet. The given cluster /subscriptions/<reducted>/resourceGroups/cluster-name-b/providers/Microsoft.ContainerService/managedClusters/cluster-name-b already joined another fleet /subscriptions/<reducted>/resourceGroups/alex-fleet-a/providers/Microsoft.ContainerService/fleets/alex-a. If you want to move the cluster to this fleet, leave the other fleet and try again. Resource ID: \"/subscriptions/<reducted>/resourceGroups/alex-fleet-b/providers/Microsoft.ContainerService/fleets/alex-fleet-b/members/cluster-name-b\". Correlation ID: \"8800d607-34e8-4d4b-b5e5-2db5a2aa8566\". Operation ID: \"417decdf-fa59-4e04-b701-f1c87e398998\""
  }
}

I believe another user brought this same problem up in another context in Slack a few days ago.
We should block mutating of owner.armId once it's set, but we don't. Agree it's a bug.

ASO currently doesn't support migrating resources between owners (indeed, there are many places where Azure doesn't support such a thing).

What mutating armId is currently doing is:

  1. Leave the old Microsoft.ContainerService/fleets/members at path /subscriptions/<reducted>/resourceGroups/alex-fleet-a/providers/Microsoft.ContainerService/fleets/alex-fleet-b/members/cluster-name-b.
  2. Create a new Microsoft.ContainerService/fleets/members at path /subscriptions/<reducted>/resourceGroups/alex-fleet-b/providers/Microsoft.ContainerService/fleets/alex-fleet-b/members/cluster-name-b.

Fleet then interprets this as "you are trying to join the same cluster to 2 fleets", because there are two fleets/members resources for the same cluster. ASO doesn't auto-delete the old resource because (in the case of things like databases) that would be very bad. Deletion should be explicit from the user.

If we fix this bug and make owner.armId immutable, the expected experience to perform this action would become:

  1. Delete old fleetMember
  2. Create new fleetMember

Thanks for the report, we'll try to get a fix in for the next release.