Bottlerocket GPU deployment issue after updated EKS module from 19.21.0 to 20.31.6

Question

Bottlerocket GPU deployment issue after updated EKS module from 19.21.0 to 20.31.6

Opened this issue 3 days ago · 1 comments

Description

I am trying to upgrade from 19.21.0 to 20.31.6. In the version 19.21.0 I was able to deploy the below managed node groups with Bottlerocket AMIs and have both the generic CPU and GPU nodes join the cluster. Now with the transition to version 20 the generic CPU nodes join the cluster just fine but the GPU nodes never join even though I'm using the same block of code for the user data as in version 19.21.0. I also am unable to connect via SSM into the GPU nodes to further troubleshoot even though they have the same IAM role attached as the CPU nodes.

My EKS version is 1.31 and the AMI release versions for Bottlerocket is 1.29.0-c55d099c

Here is the first part of the module call -

module "eks" {
source = "terraform-aws-modules/eks/aws"
version = "20.31.6"

version = "19.21.0"

cluster_name = local.cluster_name
cluster_version = local.env.cluster.cluster_version
cluster_endpoint_public_access = true
cluster_timeouts = {
create = "2h" # Timeout for creating the EKS cluster
update = "2h" # Timeout for updating the EKS cluster
delete = "2h" # Timeout for deleting the EKS cluster
}

authentication_mode = "API_AND_CONFIG_MAP"
enable_cluster_creator_admin_permissions = true

Jumping down to the managed groups:

eks_managed_node_groups = {

# bigbang Generic EKS Managed Node Groups
bigbang_generic = {
  name            = "${local.env.name}-agent-${local.random_name_suffix}"
  use_name_prefix = true
  subnet_ids      = local.env.vpc.private_subnet_ids
  # subnet_ids = local.env.vpc.public_subnet_ids

  instance_types = [local.env.cluster.agent.type]
  # Change the below to deploy either Amazon EKS optimized AMI or Bottlerocket AMI - reference the data.tf file for values
  # ami_id = data.aws_ami.eks_default.image_id
  ami_type = "BOTTLEROCKET_x86_64"
  # ami_id = data.aws_ami.eks_default_bottlerocket.image_id
  min_size     = local.env.cluster.agent.replicas.min
  desired_size = local.env.cluster.agent.replicas.desired
  max_size     = local.env.cluster.agent.replicas.max
  # Must set to false when using Bottlerocket OS AMI for EKS nodes.
  enable_bootstrap_user_data = false
 
  # When using bottlerocket, the supplied user data (TOML format) is merged in with the values supplied by EKS. Therefore, pre_bootstrap_user_data and post_bootstrap_user_data are not valid since the bottlerocket OS handles when various settings are applied.

   bootstrap_extra_args = <<-EOT
      [settings.host-containers.admin]
      enabled = true
      [settings.host-containers.control]
      enabled = true
      [settings.kernel]
      lockdown = "integrity"
      [settings.kubernetes.node-labels]
      "bottlerocket.aws/updater-interface-version" = "2.0.0"
      [settings.kubernetes]
      cluster-name = "${module.eks.cluster_name}"
      api-server = "${module.eks.cluster_endpoint}"
      cluster-certificate = "${module.eks.cluster_certificate_authority_data}"
    EOT

  # Set this to true if you want to cluster to roll to new nodes when AWS releases updated EKS node images
  force_update_version = true

  labels = {
    "bottlerocket.aws/updater-interface-version" = "2.0.0"
    GithubRepo = "terraform-aws-eks"
    GithubOrg  = "terraform-aws-modules"
  }

  update_config = {
    max_unavailable_percentage = 33 # or set `max_unavailable`
  }

  description = "EKS managed node group example launch template"

  ebs_optimized           = true
  disable_api_termination = false
  enable_monitoring       = true

  # This is for the AWS EKS Optimized AMI image
  # block_device_mappings = {
  #   xvda = {
  #     device_name = "/dev/xvda"
  #     ebs = {
  #       volume_size           = 500
  #       volume_type           = "gp3"
  #       iops                  = 3000
  #       throughput            = 150
  #       encrypted             = true
  #       kms_key_id            = module.ebs_kms_key.key_arn
  #       delete_on_termination = true
  #     }
  #   }
  # }

  # This is for Bottlerocket CPU AMI
  block_device_mappings = {
    xvda = {
      device_name = "/dev/xvda"
      ebs = {
        volume_size           = 2
        volume_type           = "gp3"
        iops                  = 3000
        throughput            = 150
        encrypted             = true
        kms_key_id            = module.ebs_kms_key.key_arn
        delete_on_termination = true
      }
    }
    xvdb = {
      device_name = "/dev/xvdb"
      ebs = {
        volume_size           = 20
        volume_type           = "gp3"
        iops                  = 3000
        throughput            = 150
        encrypted             = true
        kms_key_id            = module.ebs_kms_key.key_arn
        delete_on_termination = true
      }
    }
  }

  metadata_options = {
    http_endpoint               = "enabled"
    http_tokens                 = "required"
    http_put_response_hop_limit = 2
    instance_metadata_tags      = "disabled"
  }

  create_iam_role          = true
  iam_role_name            = "bigbang-eks-managed-node-group"
  iam_role_use_name_prefix = false
  iam_role_description     = "EKS managed node group for bigbang role"
  iam_role_tags = {
    Purpose = "Protector of the kubelet"
  }
  iam_role_additional_policies = {
    AmazonEC2ContainerRegistryReadOnly = "arn:aws:iam::aws:policy/AmazonEC2ContainerRegistryReadOnly"
    additional                         = aws_iam_policy.node_additional.arn
    AmazonEc2FullAccess                = "arn:aws:iam::aws:policy/AmazonEC2FullAccess"
    CloudWatchLogsFullAccess           = "arn:aws:iam::aws:policy/CloudWatchLogsFullAccess"
    SecretsManagerReadWrite            = "arn:aws:iam::aws:policy/SecretsManagerReadWrite"
    AmazonSSMManagedInstanceCore       = "arn:aws:iam::aws:policy/AmazonSSMManagedInstanceCore"
  }
},
bigbang_gpu = {
  name            = "${local.env.name}-agent-gpu-${local.random_name_suffix}"
  use_name_prefix = true
  subnet_ids      = local.env.vpc.private_subnet_ids
  # subnet_ids = local.env.vpc.public_subnet_ids

  instance_types = [local.env.cluster.agent.gpu_type]
  # Change the below to deploy either Amazon EKS optimized AMI or Bottlerocket AMI - reference the data.tf file for values
  # ami_id = data.aws_ami.eks_default_gpu.image_id
  ami_type     = "BOTTLEROCKET_x86_64_NVIDIA"
  # ami_id = data.aws_ami.eks_default_bottlerocket_gpu.image_id
  min_size     = local.env.cluster.agent.gpu_replicas.min
  desired_size = local.env.cluster.agent.gpu_replicas.desired
  max_size     = local.env.cluster.agent.gpu_replicas.max

  # Must set to false when using Bottlerocket OS AMI for EKS nodes.
  enable_bootstrap_user_data = false

  # When using bottlerocket, the supplied user data (TOML format) is merged in with the values supplied by EKS. Therefore, pre_bootstrap_user_data and post_bootstrap_user_data are not valid since the bottlerocket OS handles when various settings are applied.

  bootstrap_extra_args = <<-EOT
      [settings.host-containers.admin]
      enabled = true
      [settings.host-containers.control]
      enabled = true
      [settings.kernel]
      lockdown = "integrity"
      [settings.kubernetes.node-labels]
      "bottlerocket.aws/updater-interface-version" = "2.0.0"
      [settings.kubernetes]
      cluster-name = "${module.eks.cluster_name}"
      api-server = "${module.eks.cluster_endpoint}"
      cluster-certificate = "${module.eks.cluster_certificate_authority_data}"
    EOT

  # Set this to true if you want to cluster to roll to new nodes when AWS releases updated EKS node images
  force_update_version = true

  labels = {
    "bottlerocket.aws/updater-interface-version" = "2.0.0"
    GithubRepo = "terraform-aws-eks"
    GithubOrg  = "terraform-aws-modules"
    nodePool   = "gpu"
  }

  taints : [
    {
      key : "dedicated",
      value : "gpuGroup",
      effect : "NO_SCHEDULE"
    }
  ]

  #update_config = {
  #  max_unavailable_percentage = 33 # or set `max_unavailable`
  #}

  description = "EKS managed node group example launch template"

  ebs_optimized           = true
  disable_api_termination = false
  enable_monitoring       = true

  # This is for the AWS EKS Optimized AMI image
  # block_device_mappings = {
  #   xvda = {
  #     device_name = "/dev/xvda"
  #     ebs = {
  #       volume_size           = 200
  #       volume_type           = "gp3"
  #       iops                  = 3000
  #       throughput            = 150
  #       encrypted             = true
  #       kms_key_id            = module.ebs_kms_key.key_arn
  #       delete_on_termination = true
  #     }
  #   }
  # }

  # This is for Bottlerocket GPU AMI
  block_device_mappings = {
    xvda = {
      device_name = "/dev/xvda"
      ebs = {
        volume_size           = 4
        volume_type           = "gp3"
        iops                  = 3000
        throughput            = 150
        encrypted             = true
        kms_key_id            = module.ebs_kms_key.key_arn
        delete_on_termination = true
      }
    }
    xvdb = {
      device_name = "/dev/xvdb"
      ebs = {
        volume_size           = 18
        volume_type           = "gp3"
        iops                  = 3000
        throughput            = 150
        encrypted             = true
        kms_key_id            = module.ebs_kms_key.key_arn
        delete_on_termination = true
      }
    }
  }

  metadata_options = {
    http_endpoint               = "enabled"
    http_tokens                 = "required"
    http_put_response_hop_limit = 2
    instance_metadata_tags      = "disabled"
  }

  create_iam_role = false
  iam_role_arn    = "arn:aws:iam::971870020263:role/bigbang-eks-managed-node-group"
  tags            = local.env.tags
}

}

tags = local.env.tags

depends_on = [module.vpc, module.elb, module.elb_passthrough]

}

Answer 1 · 2024-12-24T14:16:46.000Z

This is the error received when upgrading from v19 to 20
Error: waiting for EKS Node Group (bigbang-development-28i:bigbang-development-agent-gpu-28i-20241224115215482100000029) version update (fc5a4074-f281-317e-a851-472bceffb830): unexpected state 'Failed', wanted target 'Successful'. last error: : NodeCreationFailure: Couldn't proceed with upgrade process as new nodes are not joining node group bigbang-development-agent-gpu-28i-20241224115215482100000029
│
│ with module.eks.module.eks_managed_node_group["bigbang_gpu"].aws_eks_node_group.this[0],
│ on .terraform\modules\eks\modules\eks-managed-node-group\main.tf line 392, in resource "aws_eks_node_group" "this":
│ 392: resource "aws_eks_node_group" "this" {
│

Here is the user data that is passed to Bottlerocket CPU in version 19
[settings.kubernetes]
"cluster-name" = "bigbang-development-28i"
"api-server" = "https://D17B803059777D9F62BD52A5EE8416E0.gr7.us-east-1.eks.amazonaws.com"
"cluster-certificate" = "LS0tLS1CRUdJTiBDRVJUSUZJQ0FURS0tLS0tCk1JSURCVENDQWUyZ0F3SUJBZ0lJYXVTQWg2elFzb3d3RFFZSktvWklodmNOQVFFTEJRQXdGVEVUTUJFR0ExVUUKQXhNS2EzVmlaWEp1WlhSbGN6QWVGdzB5TkRFeU1qUXhNVFF3TXpSYUZ3MHpOREV5TWpJeE1UUTFNelJhTUJVeApFekFSQmdOVkJBTVRDbXQxWW1WeWJtVjBaWE13Z2dFaU1BMEdDU3FHU0liM0RRRUJBUVVBQTRJQkR3QXdnZ0VLCkFvSUJBUURNSGdLbkphOWd6Snlqak1XUzRZZ3NRRmh2Z2dCQVRrYjJTNnY2K0dpWGxpczZ4UDhvUE5VN2hDSDYKT2doUnlrUjI0VGVCQkQvQnRBODJ0NXYxMzBGdTBBU1RPcmhNT0M2NDh0TzNodFhJdytaa1pmeVFIQ2JlYWx3awoxWnZTbHAxVlRLeG5YWmZ5QVpEK2lPdEhjci9HbEEzc1hkcnBVSVdrSkN2NGFQZVZPQnZSVm8wbmw5dmx3RzlsClZhc1ZaVFhIcWlKMjVEWitoV2U2emNydm9KdFFCdVVjZTB1OUd3MXBReUc5eEFUNjdZSElhRTBNZkdLelMwdGoKMklPb2s3eHNpbUZMRkxjKzUwS09UY25pRkJpOWMyL1IzMGlJanJhd3RyL2F5S1NsNnA5MWh1ek5XK2E4YVZYRgpqbGJlR3dXSVhZYjVUTy9KczBLdXQ1V1ZxM2J4QWdNQkFBR2pXVEJYTUE0R0ExVWREd0VCL3dRRUF3SUNwREFQCkJnTlZIUk1CQWY4RUJUQURBUUgvTUIwR0ExVWREZ1FXQkJRQS82aGJUaVpXZ3VKcHBuU3BQWXltWFArVFl6QVYKQmdOVkhSRUVEakFNZ2dwcmRXSmxjbTVsZEdWek1BMEdDU3FHU0liM0RRRUJDd1VBQTRJQkFRQ3psajNnM0RpaAphbVh0RlcwejhORFdxN2dnN1ltU2hoRW5INUpaNThuSVZ0Wk1DaHBLNFdBR3krQzZwSVVIYlQ4YUFydGYxSmk0CkxiSTFiZys4WitLMmpmYTYvWmtBeWFyS2gvK25PV0tTKzVvN3lXUnBaSVpBSHJYb1lPUk9aOTZGaHZmaTJ1dFcKZGdaNHBNT1NZcUdIaStmNXhxMlBiYnlPUkU5R3F5R0p0UmI1aEw1aGNyNnRST1dVaHl1UE1BS01TMjNPbzVCMAo0dnFTcytseGxORkRJa2Z5T3IxaitLRWpCYUhIVWNZMjQ2cTExSnIxR3oyQTF6ZXAvNmgrTVBlYzMydWNFeVZiCis3TTlMbFlWMVpRNDFIWUhKdi9tNit4N1hOTzRDMmoxSGlrNXo1dm9tU2xMUUU2OG1HcGZ3Tm8zVGh6Ti9ubXkKY0pZNUllQStRUmxlCi0tLS0tRU5EIENFUlRJRklDQVRFLS0tLS0K"
"cluster-dns-ip" = "172.20.0.10"
"max-pods" = 58
[settings.kubernetes.node-labels]
"eks.amazonaws.com/sourceLaunchTemplateVersion" = "1"
"bottlerocket.aws/updater-interface-version" = "2.0.0"
"GithubRepo" = "terraform-aws-eks"
"eks.amazonaws.com/nodegroup-image" = "ami-08e8202f7551d19fb"
"eks.amazonaws.com/capacityType" = "ON_DEMAND"
"eks.amazonaws.com/nodegroup" = "bigbang-development-agent-28i-20241224115215482100000027"
"eks.amazonaws.com/sourceLaunchTemplateId" = "lt-0a5a16bb240eab278"
"GithubOrg" = "terraform-aws-modules"

Here is the user data that is passed to the GPU instance in version 19

[settings.kubernetes]
"cluster-name" = "bigbang-development-28i"
"api-server" = "https://D17B803059777D9F62BD52A5EE8416E0.gr7.us-east-1.eks.amazonaws.com"
"cluster-certificate" = "LS0tLS1CRUdJTiBDRVJUSUZJQ0FURS0tLS0tCk1JSURCVENDQWUyZ0F3SUJBZ0lJYXVTQWg2elFzb3d3RFFZSktvWklodmNOQVFFTEJRQXdGVEVUTUJFR0ExVUUKQXhNS2EzVmlaWEp1WlhSbGN6QWVGdzB5TkRFeU1qUXhNVFF3TXpSYUZ3MHpOREV5TWpJeE1UUTFNelJhTUJVeApFekFSQmdOVkJBTVRDbXQxWW1WeWJtVjBaWE13Z2dFaU1BMEdDU3FHU0liM0RRRUJBUVVBQTRJQkR3QXdnZ0VLCkFvSUJBUURNSGdLbkphOWd6Snlqak1XUzRZZ3NRRmh2Z2dCQVRrYjJTNnY2K0dpWGxpczZ4UDhvUE5VN2hDSDYKT2doUnlrUjI0VGVCQkQvQnRBODJ0NXYxMzBGdTBBU1RPcmhNT0M2NDh0TzNodFhJdytaa1pmeVFIQ2JlYWx3awoxWnZTbHAxVlRLeG5YWmZ5QVpEK2lPdEhjci9HbEEzc1hkcnBVSVdrSkN2NGFQZVZPQnZSVm8wbmw5dmx3RzlsClZhc1ZaVFhIcWlKMjVEWitoV2U2emNydm9KdFFCdVVjZTB1OUd3MXBReUc5eEFUNjdZSElhRTBNZkdLelMwdGoKMklPb2s3eHNpbUZMRkxjKzUwS09UY25pRkJpOWMyL1IzMGlJanJhd3RyL2F5S1NsNnA5MWh1ek5XK2E4YVZYRgpqbGJlR3dXSVhZYjVUTy9KczBLdXQ1V1ZxM2J4QWdNQkFBR2pXVEJYTUE0R0ExVWREd0VCL3dRRUF3SUNwREFQCkJnTlZIUk1CQWY4RUJUQURBUUgvTUIwR0ExVWREZ1FXQkJRQS82aGJUaVpXZ3VKcHBuU3BQWXltWFArVFl6QVYKQmdOVkhSRUVEakFNZ2dwcmRXSmxjbTVsZEdWek1BMEdDU3FHU0liM0RRRUJDd1VBQTRJQkFRQ3psajNnM0RpaAphbVh0RlcwejhORFdxN2dnN1ltU2hoRW5INUpaNThuSVZ0Wk1DaHBLNFdBR3krQzZwSVVIYlQ4YUFydGYxSmk0CkxiSTFiZys4WitLMmpmYTYvWmtBeWFyS2gvK25PV0tTKzVvN3lXUnBaSVpBSHJYb1lPUk9aOTZGaHZmaTJ1dFcKZGdaNHBNT1NZcUdIaStmNXhxMlBiYnlPUkU5R3F5R0p0UmI1aEw1aGNyNnRST1dVaHl1UE1BS01TMjNPbzVCMAo0dnFTcytseGxORkRJa2Z5T3IxaitLRWpCYUhIVWNZMjQ2cTExSnIxR3oyQTF6ZXAvNmgrTVBlYzMydWNFeVZiCis3TTlMbFlWMVpRNDFIWUhKdi9tNit4N1hOTzRDMmoxSGlrNXo1dm9tU2xMUUU2OG1HcGZ3Tm8zVGh6Ti9ubXkKY0pZNUllQStRUmxlCi0tLS0tRU5EIENFUlRJRklDQVRFLS0tLS0K"
"cluster-dns-ip" = "172.20.0.10"
"max-pods" = 58
[settings.kubernetes.node-labels]
"nodePool" = "gpu"
"eks.amazonaws.com/sourceLaunchTemplateVersion" = "1"
"bottlerocket.aws/updater-interface-version" = "2.0.0"
"GithubRepo" = "terraform-aws-eks"
"eks.amazonaws.com/nodegroup-image" = "ami-016501e3b19da26b2"
"eks.amazonaws.com/capacityType" = "ON_DEMAND"
"eks.amazonaws.com/nodegroup" = "bigbang-development-agent-gpu-28i-20241224115215482100000029"
"eks.amazonaws.com/sourceLaunchTemplateId" = "lt-0645b7457ca252e23"
"GithubOrg" = "terraform-aws-modules"
[settings.kubernetes.node-taints]
"dedicated" = "gpuGroup:NoSchedule"

Now after the upgrade to version 20 this is what it looks like for the CPU nodes

settings.kubernetes.cluster-name = 'bigbang-development-28i'
settings.kubernetes.api-server = 'https://D17B803059777D9F62BD52A5EE8416E0.gr7.us-east-1.eks.amazonaws.com'
settings.kubernetes.cluster-certificate = 'LS0tLS1CRUdJTiBDRVJUSUZJQ0FURS0tLS0tCk1JSURCVENDQWUyZ0F3SUJBZ0lJYXVTQWg2elFzb3d3RFFZSktvWklodmNOQVFFTEJRQXdGVEVUTUJFR0ExVUUKQXhNS2EzVmlaWEp1WlhSbGN6QWVGdzB5TkRFeU1qUXhNVFF3TXpSYUZ3MHpOREV5TWpJeE1UUTFNelJhTUJVeApFekFSQmdOVkJBTVRDbXQxWW1WeWJtVjBaWE13Z2dFaU1BMEdDU3FHU0liM0RRRUJBUVVBQTRJQkR3QXdnZ0VLCkFvSUJBUURNSGdLbkphOWd6Snlqak1XUzRZZ3NRRmh2Z2dCQVRrYjJTNnY2K0dpWGxpczZ4UDhvUE5VN2hDSDYKT2doUnlrUjI0VGVCQkQvQnRBODJ0NXYxMzBGdTBBU1RPcmhNT0M2NDh0TzNodFhJdytaa1pmeVFIQ2JlYWx3awoxWnZTbHAxVlRLeG5YWmZ5QVpEK2lPdEhjci9HbEEzc1hkcnBVSVdrSkN2NGFQZVZPQnZSVm8wbmw5dmx3RzlsClZhc1ZaVFhIcWlKMjVEWitoV2U2emNydm9KdFFCdVVjZTB1OUd3MXBReUc5eEFUNjdZSElhRTBNZkdLelMwdGoKMklPb2s3eHNpbUZMRkxjKzUwS09UY25pRkJpOWMyL1IzMGlJanJhd3RyL2F5S1NsNnA5MWh1ek5XK2E4YVZYRgpqbGJlR3dXSVhZYjVUTy9KczBLdXQ1V1ZxM2J4QWdNQkFBR2pXVEJYTUE0R0ExVWREd0VCL3dRRUF3SUNwREFQCkJnTlZIUk1CQWY4RUJUQURBUUgvTUIwR0ExVWREZ1FXQkJRQS82aGJUaVpXZ3VKcHBuU3BQWXltWFArVFl6QVYKQmdOVkhSRUVEakFNZ2dwcmRXSmxjbTVsZEdWek1BMEdDU3FHU0liM0RRRUJDd1VBQTRJQkFRQ3psajNnM0RpaAphbVh0RlcwejhORFdxN2dnN1ltU2hoRW5INUpaNThuSVZ0Wk1DaHBLNFdBR3krQzZwSVVIYlQ4YUFydGYxSmk0CkxiSTFiZys4WitLMmpmYTYvWmtBeWFyS2gvK25PV0tTKzVvN3lXUnBaSVpBSHJYb1lPUk9aOTZGaHZmaTJ1dFcKZGdaNHBNT1NZcUdIaStmNXhxMlBiYnlPUkU5R3F5R0p0UmI1aEw1aGNyNnRST1dVaHl1UE1BS01TMjNPbzVCMAo0dnFTcytseGxORkRJa2Z5T3IxaitLRWpCYUhIVWNZMjQ2cTExSnIxR3oyQTF6ZXAvNmgrTVBlYzMydWNFeVZiCis3TTlMbFlWMVpRNDFIWUhKdi9tNit4N1hOTzRDMmoxSGlrNXo1dm9tU2xMUUU2OG1HcGZ3Tm8zVGh6Ti9ubXkKY0pZNUllQStRUmxlCi0tLS0tRU5EIENFUlRJRklDQVRFLS0tLS0K'
settings.kubernetes.cluster-dns-ip = '172.20.0.10'
settings.kubernetes.max-pods = 58
settings.kubernetes.node-labels.'eks.amazonaws.com/sourceLaunchTemplateVersion' = '2'
settings.kubernetes.node-labels.'bottlerocket.aws/updater-interface-version' = '2.0.0'
settings.kubernetes.node-labels.GithubRepo = 'terraform-aws-eks'
settings.kubernetes.node-labels.'eks.amazonaws.com/nodegroup-image' = 'ami-08e8202f7551d19fb'
settings.kubernetes.node-labels.'eks.amazonaws.com/capacityType' = 'ON_DEMAND'
settings.kubernetes.node-labels.'eks.amazonaws.com/nodegroup' = 'bigbang-development-agent-28i-20241224115215482100000027'
settings.kubernetes.node-labels.'eks.amazonaws.com/sourceLaunchTemplateId' = 'lt-0a5a16bb240eab278'
settings.kubernetes.node-labels.GithubOrg = 'terraform-aws-modules'
settings.host-containers.admin.enabled = true
settings.host-containers.control.enabled = true
settings.kernel.lockdown = 'integrity'

For the GPU instances - I had to delete the managed group from the EKS console and rerun terraform to get them to build

settings.kubernetes.cluster-name = 'bigbang-development-28i'
settings.kubernetes.api-server = 'https://D17B803059777D9F62BD52A5EE8416E0.gr7.us-east-1.eks.amazonaws.com'
settings.kubernetes.cluster-certificate = 'LS0tLS1CRUdJTiBDRVJUSUZJQ0FURS0tLS0tCk1JSURCVENDQWUyZ0F3SUJBZ0lJYXVTQWg2elFzb3d3RFFZSktvWklodmNOQVFFTEJRQXdGVEVUTUJFR0ExVUUKQXhNS2EzVmlaWEp1WlhSbGN6QWVGdzB5TkRFeU1qUXhNVFF3TXpSYUZ3MHpOREV5TWpJeE1UUTFNelJhTUJVeApFekFSQmdOVkJBTVRDbXQxWW1WeWJtVjBaWE13Z2dFaU1BMEdDU3FHU0liM0RRRUJBUVVBQTRJQkR3QXdnZ0VLCkFvSUJBUURNSGdLbkphOWd6Snlqak1XUzRZZ3NRRmh2Z2dCQVRrYjJTNnY2K0dpWGxpczZ4UDhvUE5VN2hDSDYKT2doUnlrUjI0VGVCQkQvQnRBODJ0NXYxMzBGdTBBU1RPcmhNT0M2NDh0TzNodFhJdytaa1pmeVFIQ2JlYWx3awoxWnZTbHAxVlRLeG5YWmZ5QVpEK2lPdEhjci9HbEEzc1hkcnBVSVdrSkN2NGFQZVZPQnZSVm8wbmw5dmx3RzlsClZhc1ZaVFhIcWlKMjVEWitoV2U2emNydm9KdFFCdVVjZTB1OUd3MXBReUc5eEFUNjdZSElhRTBNZkdLelMwdGoKMklPb2s3eHNpbUZMRkxjKzUwS09UY25pRkJpOWMyL1IzMGlJanJhd3RyL2F5S1NsNnA5MWh1ek5XK2E4YVZYRgpqbGJlR3dXSVhZYjVUTy9KczBLdXQ1V1ZxM2J4QWdNQkFBR2pXVEJYTUE0R0ExVWREd0VCL3dRRUF3SUNwREFQCkJnTlZIUk1CQWY4RUJUQURBUUgvTUIwR0ExVWREZ1FXQkJRQS82aGJUaVpXZ3VKcHBuU3BQWXltWFArVFl6QVYKQmdOVkhSRUVEakFNZ2dwcmRXSmxjbTVsZEdWek1BMEdDU3FHU0liM0RRRUJDd1VBQTRJQkFRQ3psajNnM0RpaAphbVh0RlcwejhORFdxN2dnN1ltU2hoRW5INUpaNThuSVZ0Wk1DaHBLNFdBR3krQzZwSVVIYlQ4YUFydGYxSmk0CkxiSTFiZys4WitLMmpmYTYvWmtBeWFyS2gvK25PV0tTKzVvN3lXUnBaSVpBSHJYb1lPUk9aOTZGaHZmaTJ1dFcKZGdaNHBNT1NZcUdIaStmNXhxMlBiYnlPUkU5R3F5R0p0UmI1aEw1aGNyNnRST1dVaHl1UE1BS01TMjNPbzVCMAo0dnFTcytseGxORkRJa2Z5T3IxaitLRWpCYUhIVWNZMjQ2cTExSnIxR3oyQTF6ZXAvNmgrTVBlYzMydWNFeVZiCis3TTlMbFlWMVpRNDFIWUhKdi9tNit4N1hOTzRDMmoxSGlrNXo1dm9tU2xMUUU2OG1HcGZ3Tm8zVGh6Ti9ubXkKY0pZNUllQStRUmxlCi0tLS0tRU5EIENFUlRJRklDQVRFLS0tLS0K'
settings.kubernetes.cluster-dns-ip = '172.20.0.10'
settings.kubernetes.max-pods = 58
settings.kubernetes.node-labels.nodePool = 'gpu'
settings.kubernetes.node-labels.'eks.amazonaws.com/sourceLaunchTemplateVersion' = '2'
settings.kubernetes.node-labels.'bottlerocket.aws/updater-interface-version' = '2.0.0'
settings.kubernetes.node-labels.GithubRepo = 'terraform-aws-eks'
settings.kubernetes.node-labels.'eks.amazonaws.com/nodegroup-image' = 'ami-016501e3b19da26b2'
settings.kubernetes.node-labels.'eks.amazonaws.com/capacityType' = 'ON_DEMAND'
settings.kubernetes.node-labels.'eks.amazonaws.com/nodegroup' = 'bigbang-development-agent-gpu-28i-20241224141205476300000001'
settings.kubernetes.node-labels.'eks.amazonaws.com/sourceLaunchTemplateId' = 'lt-0645b7457ca252e23'
settings.kubernetes.node-labels.GithubOrg = 'terraform-aws-modules'
settings.kubernetes.node-taints.dedicated = 'gpuGroup:NoSchedule'
settings.host-containers.admin.enabled = true
settings.host-containers.control.enabled = true
settings.kernel.lockdown = 'integrity'

Same AMI ID across the EKS module versions just the GPU on version 20 will not join the cluster
CPU AMI ID - bottlerocket-aws-k8s-1.31-x86_64-v1.29.0-c55d099c
GPU AMI ID - bottlerocket-aws-k8s-1.31-nvidia-x86_64-v1.29.0-c55d099c