Azure/terraform-azurerm-caf-enterprise-scale

Changing root_parent_id results in Management Groups not being deployed

lpmulligan opened this issue ยท 10 comments

Community Note

  • Please vote on this issue by adding a ๐Ÿ‘ reaction to the original issue to help the community and maintainers prioritize this request
  • Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment

Versions

terraform: 1.0.8

azure provider: 2.80.0

module: 1.0.0

Description

I'm trying to deploy an Enterprise Scale LZ deployment within an existing EA structure. I've been assigned owner to a Management Group 3 levels down from the Tenant Root Group. I would like to deploy a stock ES LZ within that management group. Changing root_parent_id from:

root_parent_id = data.azurerm_client_config.core.tenant_id
root_id        = "myorg"
root_name      = "My Organization"

deploy_demo_landing_zones = true

to

root_parent_id = "ELZDemo"
root_id        = "myorg"
root_name      = "My Organization"

deploy_demo_landing_zones = true

Results in all the
resource "azurerm_management_group" "level_1", resource "azurerm_management_group" "level_2", etc resources being skipped.

Manually creating the Management Groups under "ELZDemo" by hand allows the core management terraform apply to run successfully.

Steps to Reproduce

  1. Set root_parent_id = data.azurerm_client_config.core.tenant_id in main.tf
  2. Run terraform plan
  3. You get MG groups created with 191 resources
  4. Set root_parent_id = "ELZDemo"
  5. You get no MG groups with 180 resources.

Screenshots

N/A

Additional context

N/A

Hey @lpmulligan,

Thanks for raising an issue ๐Ÿ‘

I have just tested this myself and it worked okay for me:
image

As you can see highlighted in orange.

My TF file looked like this:

# We strongly recommend using the required_providers block to set the
# Azure Provider source and version being used.

terraform {
  required_providers {
    azurerm = {
      source  = "hashicorp/azurerm"
      version = ">= 2.77.0"
    }
  }
}

provider "azurerm" {
  features {}
}

# You can use the azurerm_client_config data resource to dynamically
# extract connection settings from the provider configuration.

data "azurerm_client_config" "core" {}

# Call the caf-enterprise-scale module directly from the Terraform Registry
# pinning to the latest version

module "enterprise_scale" {
  source  = "Azure/caf-enterprise-scale/azurerm"
  version = "1.0.0"

  providers = {
    azurerm              = azurerm
    azurerm.connectivity = azurerm
    azurerm.management   = azurerm
  }

  root_parent_id = "jt-102021-landingzones"
  root_id        = "190"
  root_name      = "GH Issue 190 Test"

  deploy_demo_landing_zones = true
}

Are you able to share your TF file/s and a simple diagram of your Management Group hierarchy? (please redact any sensitive information)

Look forward to hearing from you

Thanks

Jack

@jtracey93. Thanks for double-checking this for me.

I did try this deployment in two separate Azure tenants to make sure I didn't miss something. The second tenant being my MSDN one. I got the same results; hence, I opened up this issue.

A few things:

  1. I reran things again in my MSDN subscription and things did work as expected. I think I may have forgotten
    to add deploy_demo_landing_zones = true during my testing. Sorry about. Adding deploy_demo_landing_zones = true may not be the reason it worked. See #4 below.

  2. In my EA subscription, things continue to not work.

  • Here's a screenshot of my MG layout.
    MG Groups

You'll notice that I don't have access to all the MGs in the tree like you do...nor like I do in my MSDN sub!

  • My code is similar to yours. Here's the module block
module "enterprise_scale" {
  source  = "Azure/caf-enterprise-scale/azurerm"
  version = "1.0.0"

  providers = {
    azurerm              = azurerm
    azurerm.connectivity = azurerm
    azurerm.management   = azurerm
  }

  root_parent_id = "ELZDemo"
  root_id        = "eslz"
  root_name      = "Enterprise-Scale LZ"

  deploy_demo_landing_zones = true
}
  1. Some interesting developments
  • Setting my module to this...exactly like yours
module "enterprise_scale" {
  source  = "Azure/caf-enterprise-scale/azurerm"
  version = "1.0.0"

  providers = {
    azurerm              = azurerm
    azurerm.connectivity = azurerm
    azurerm.management   = azurerm
  }

  root_parent_id = "jt-102021-landingzones"
  root_id        = "190"
  root_name      = "GH Issue 190 Test"

  deploy_demo_landing_zones = true
}

Results in the TF plan saying it's going to create the MG groups as expected even though the parent MG "jt-102021-landingzones" doesn't exist.

However, if I change root_parent_id = "jt-102021-landingzones" to root_parent_id = "ELZDemo" the TF plan skips the creation of the MGs.

  1. Sorry for waiting until the end, but I think I've narrowed down the problem.
  • If I set root_parent_id = "ELZDemo", it doesn't work. However, if I set root_parent_id = "eslz-demo" it does. So I'm thinking there's something up with how the module identifies the root_parent_id MG and there's no hyphen in there.

Sincerely,

Larry

Hey @lpmulligan,

Thanks for the excellent write up, really useful. Let me review and investigate further and we'll come back to you ๐Ÿ‘

Cheers

Jack

Quick update have tested the hyphen theory out further:
image

Gives me the 191 resources:
image

Code:

terraform {
  required_providers {
    azurerm = {
      source  = "hashicorp/azurerm"
      version = ">= 2.77.0"
    }
  }
}

provider "azurerm" {
  features {}
}

module "enterprise_scale" {
  source  = "Azure/caf-enterprise-scale/azurerm"
  version = "1.0.0"

  providers = {
    azurerm              = azurerm
    azurerm.connectivity = azurerm
    azurerm.management   = azurerm
  }

  root_parent_id = "tf190level3"
  root_id        = "eslz"
  root_name      = "Enterprise-Scale LZ"

  deploy_demo_landing_zones = true
}

Also just tried the same hierarchy above with an SPN that only has RBAC as below:

  • Reader on a subscription (used in provider block) outside of "tf190" hierarchy
  • Owner on "tf190level3"

Still got a correct plan of 191 resources

Deployed with above SPN:
image

Hey @lpmulligan,

So from digging through the module a little deeper tonight we think we may of found the reason why this is happening.

These 2 lines in locals.management_group.tf seem to be of importance for this scenario and its actually looking like its due to the casing used for the input of root_parent_id:

On the line, from the first link, we are building all of the maps for the Management Groups to be created by the module, but we are evaluating if the parent_management_group_id (that we create above in the same file, ready for deployment later) is the same (case sensitive as its Terraform) as root_parent_id.

But in the second link we have some logic that says if the parent_management_group_id is longer than 0 characters in length, which it is, then lowercase the string and replace any non-alphanumeric characters with -'s.

So when combined your root_parent_id of ELZDemo turns into elzdemo and when they are compared, remembering they are case sensitive, they don't match. Which then causes the if statement (first link) to kick in and make the map empty for level_1, which then has a cascading effect down through the other levels.

So what are we going to do...?

Good question, we need to do some further testing of some potential fixes and we also need to just check "why" we implemented the logic to start with an we will then make a PR to fix.

Give us a few days and we will come back to you ๐Ÿ‘

Hope that all makes sense, and feel free to reach out with any further questions or queries

Thanks

Jack

(also thanks to @krowlandson for the late night discussion to review my thinking ๐Ÿ‘)

Well shoot. I thought I had it. I think it's something in my environment not related to the code. I just had a co-worker change the MG ID from ELZDemo to eslz-demo and it works. However, your test shows that my hypothesis was wrong which is fine. Something's up with my setup...I don't know where but feel free to close this issue. Thanks for your assistance.

You certainly led us down the right path, so thank you ๐Ÿ‘

We will leave this open so we can track the bug fix ๐Ÿ‘

If you have another issue, feel free to open a another issue on the repo. Happy to help ๐Ÿ‘

Well shoot. I thought I had it. I think it's something in my environment not related to the code. I just had a co-worker change the MG ID from ELZDemo to eslz-demo and it works. However, your test shows that my hypothesis was wrong which is fine. Something's up with my setup...I don't know where but feel free to close this issue. Thanks for your assistance.

@lpmulligan - just to add that it may have worked if you just set root_parent_id = "elzdemo" (i.e. all lowercase).

As per @jtracey93's follow up, we have found the point where this is going wrong for you in the code and will work on getting this resolved in a future release. We first need to work out the impact of the fix though, as we may cause unexpected changes if someone has inadvertently used the incorrect casing somewhere within their config and the following is automatically "correcting" this:

parent_management_group_id = try(length(value.parent_management_group_id) > 0, false) ? replace(lower(value.parent_management_group_id), "/[^a-z0-9]/", "-") : local.root_parent_id

We need to be careful to understand whether changing this will cause deployments with incorrect config to fail, or to redeploy Management Groups.