Azure/AKS-Construction

CI/CD Deployment testing failure

Gordonby opened this issue · 5 comments

Describe the bug
During full deployment tests in the CI/CD pipeline, we get an error because of the state of the environment we're deploying to.

To Reproduce

  1. Tag a PR with either;
    test-deploy-byoconfig
    test-deploy-privateconfig
  2. Wait for checks to run and fail.

Expected behavior
Environment considerations are properly reset so deployment tests can run.

Additional context

ERROR: {"status":"Failed","error":{"code":"DeploymentFailed","message":"At least one resource deployment operation failed. 
Please list deployment operations for details. Please see https://aka.ms/arm-deployment-operations for usage details.",
"details":[{"code":"BadRequest","message":"{\r\n  \"error\": {\r\n    
\"code\": \"RoleAssignmentUpdateNotPermitted\",\r\n    
\"message\": \"Tenant ID, application ID, principal ID, and scope are not allowed to be updated.\"\r\n  }\r\n}"}]}}

The role assignment thats having the problem is the RG Reader role for AppGw;

// AGIC's identity requires "Reader" permission over Application Gateway's resource group.
var reader = subscriptionResourceId('Microsoft.Authorization/roleDefinitions', 'acdd72a7-3385-48ef-bd42-f606fba81ae7')
resource appGwAGICRGReader 'Microsoft.Authorization/roleAssignments@2022-04-01' = if (ingressApplicationGateway && deployAppGw) {
  scope: resourceGroup()
  name: guid(aks.id, 'Agic', reader)
  properties: {
    roleDefinitionId: reader
    principalType: 'ServicePrincipal'
    principalId: aks.properties.addonProfiles.ingressApplicationGateway.identity.objectId
  }
}

The reason it's having a problem is because the name isn't unique. It's using a static 'Agic' string instead of an identifier to the identity such as principalId. This is because the identity is not known before main.bicep is launched, therefore it cannot form part of the name.

I see 3 options for resolution;

  1. Refactor the AppGw out of main.bicep, which will allow the role assignment name guid to be based on the already existing AKS AppGW ObjectId.
  2. Debug why the environment cleanup is not removing this role assignment (it should). Implement fix.
  3. Speak to the AKS team about addOn support for existing managed identities

Raised option 3 in the AKS repo.

Azure/AKS#3863

samaea commented

Thanks for reporting @Gordonby. + @khowling what is the impact on the above?

Thanks for reporting @Gordonby. + @khowling what is the impact on the above?

The impact is that ;

  1. 2 of the scheduled CI jobs will always fail.
  2. Full testing in PR will always fail
  3. My Pr is blocked #647

image

samaea commented

@mosabami to review next steps.

@pjlewisuk will be discussing this with @Gordonby and @samaea