AutoScalingGroup timeout: Status Reason: 'kubernetes.io/cluster/clustername' is not a valid tag key.

Question

AutoScalingGroup timeout: Status Reason: 'kubernetes.io/cluster/clustername' is not a valid tag key.

brenwhyte opened this issue 3 years ago · 15 comments

brenwhyte commented 3 years ago

Description

Testing the Complete and the IRSA examples I get an error that the AutoScalingGroup can't create instances:

Versions

❯ terraform version
Terraform v1.1.3
on linux_amd64

provider registry.terraform.io/hashicorp/aws v3.72.0
provider registry.terraform.io/hashicorp/cloudinit v2.2.0
provider registry.terraform.io/hashicorp/helm v2.4.1
provider registry.terraform.io/hashicorp/null v3.1.0
provider registry.terraform.io/hashicorp/tls v3.1.0

Reproduction

Clone repo and test the examples.

Code Snippet to Reproduce

Expected behavior

Auto Scaling group creates instances

Actual behavior

Terminal Output Screenshot(s)

module.eks.module.self_managed_node_group["spot"].aws_autoscaling_group.this[0]: Still destroying... [id=spot-20220117114248850100000023, 9m50s elapsed]
module.eks.module.self_managed_node_group["refresh"].aws_autoscaling_group.this[0]: Still creating... [9m50s elapsed]
module.eks.module.self_managed_node_group["spot"].aws_autoscaling_group.this[0]: Still destroying... [id=spot-20220117114248850100000023, 10m0s elapsed]
module.eks.module.self_managed_node_group["refresh"].aws_autoscaling_group.this[0]: Still creating... [10m0s elapsed]
╷
│ Error: "refresh-20220117122547139700000001": Waiting up to 10m0s: Need at least 1 healthy instances in ASG, have 0. Most recent activity: {
│   ActivityId: "a8b5f9a0-2397-c9ec-a78e-25f731576957",
│   AutoScalingGroupARN: "arn:aws:autoscaling:eu-west-1:*snip*:autoScalingGroup:a6153771-becc-4c1c-89da-44d7e9044423:autoScalingGroupName/refresh-20220117122547139700000001",
│   AutoScalingGroupName: "refresh-20220117122547139700000001",
│   Cause: "At 2022-01-17T12:35:19Z an instance was started in response to a difference between desired and actual capacity, increasing the capacity from 0 to 1.",
│   Description: "Launching a new EC2 instance.  Status Reason: 'kubernetes.io/cluster/exirsatest' is not a valid tag key. Tag keys must match pattern ([0-9a-zA-Z\\\\-_+=,.@:]{1,255}), and must not be a reserved name ('.', '..', '_index'). Launching EC2 instance failed.",
│   Details: "{\"Subnet ID\":\"subnet-006ccbc3820532b53\",\"Availability Zone\":\"eu-west-1c\"}",
│   EndTime: 2022-01-17 12:35:20 +0000 UTC,
│   Progress: 100,
│   StartTime: 2022-01-17 12:35:20.69 +0000 UTC,
│   StatusCode: "Failed",
│   StatusMessage: "'kubernetes.io/cluster/exirsatest' is not a valid tag key. Tag keys must match pattern ([0-9a-zA-Z\\\\-_+=,.@:]{1,255}), and must not be a reserved name ('.', '..', '_index'). Launching EC2 instance failed."
│ }
│ 
│   with module.eks.module.self_managed_node_group["refresh"].aws_autoscaling_group.this[0],
│   on ../../modules/self-managed-node-group/main.tf line 260, in resource "aws_autoscaling_group" "this":
│  260: resource "aws_autoscaling_group" "this" {
│ 
╵

Additional context

You can see I tried renaming the folder to exirsatest and that didn't help.

By removing the keys below keys unblocked the ASG but is not a fix of course.

kubernetes.io/cluster/exirsatest
k8s.io/cluster/ex-irsatest

Answer 1 · 2022-01-17T12:58:44.000Z

please paste a config of module which you using

Answer 2 · 2022-01-17T13:11:26.000Z

❯ git clone https://github.com/terraform-aws-modules/terraform-aws-eks.git
Cloning into 'terraform-aws-eks'...
remote: Enumerating objects: 4176, done.
remote: Counting objects: 100% (843/843), done.
remote: Compressing objects: 100% (457/457), done.
remote: Total 4176 (delta 508), reused 620 (delta 381), pack-reused 3333
Receiving objects: 100% (4176/4176), 1.42 MiB | 535.00 KiB/s, done.
Resolving deltas: 100% (2716/2716), done.
❯ cd terraform-aws-eks/examples/irsa_autoscale_refresh
❯ terraform init
Initializing modules...
Downloading registry.terraform.io/terraform-aws-modules/iam/aws 4.9.0 for aws_node_termination_handler_role...
- aws_node_termination_handler_role in .terraform/modules/aws_node_termination_handler_role/modules/iam-assumable-role-with-oidc
Downloading registry.terraform.io/terraform-aws-modules/sqs/aws 3.2.1 for aws_node_termination_handler_sqs...
- aws_node_termination_handler_sqs in .terraform/modules/aws_node_termination_handler_sqs
- eks in ../..
- eks.eks_managed_node_group in ../../modules/eks-managed-node-group
- eks.eks_managed_node_group.user_data in ../../modules/_user_data
- eks.fargate_profile in ../../modules/fargate-profile
- eks.self_managed_node_group in ../../modules/self-managed-node-group
- eks.self_managed_node_group.user_data in ../../modules/_user_data
Downloading registry.terraform.io/terraform-aws-modules/iam/aws 4.9.0 for iam_assumable_role_cluster_autoscaler...
- iam_assumable_role_cluster_autoscaler in .terraform/modules/iam_assumable_role_cluster_autoscaler/modules/iam-assumable-role-with-oidc
Downloading registry.terraform.io/terraform-aws-modules/vpc/aws 3.11.3 for vpc...
- vpc in .terraform/modules/vpc

Initializing the backend...

Initializing provider plugins...
- Finding hashicorp/helm versions matching ">= 2.0.0"...
- Finding hashicorp/aws versions matching ">= 2.23.0, >= 3.63.0, >= 3.72.0"...
- Finding hashicorp/null versions matching ">= 3.0.0"...
- Finding hashicorp/tls versions matching ">= 2.2.0"...
- Finding hashicorp/cloudinit versions matching ">= 2.0.0"...
- Installing hashicorp/helm v2.4.1...
- Installed hashicorp/helm v2.4.1 (signed by HashiCorp)
- Installing hashicorp/aws v3.72.0...
- Installed hashicorp/aws v3.72.0 (signed by HashiCorp)
- Installing hashicorp/null v3.1.0...
- Installed hashicorp/null v3.1.0 (signed by HashiCorp)
- Installing hashicorp/tls v3.1.0...
- Installed hashicorp/tls v3.1.0 (signed by HashiCorp)
- Installing hashicorp/cloudinit v2.2.0...
- Installed hashicorp/cloudinit v2.2.0 (signed by HashiCorp)

Terraform has created a lock file .terraform.lock.hcl to record the provider
selections it made above. Include this file in your version control repository
so that Terraform can guarantee to make the same selections by default when
you run "terraform init" in the future.

Terraform has been successfully initialized!

You may now begin working with Terraform. Try running "terraform plan" to see
any changes that are required for your infrastructure. All Terraform commands
should now work.

If you ever set or change modules or backend configuration for Terraform,
rerun this command to reinitialize your working directory. If you forget, other
commands will detect it and remind you to do so if necessary.
❯ terraform apply --auto-approve

and it should timeout/fail when creating the ASGs

Answer 3 · 2022-01-17T13:27:33.000Z

All three Auto Scaling groups have the issue from the example:

Answer 4 · 2022-01-17T15:31:27.000Z

it looks fine per awsdocs/amazon-eks-user-guide#38 (comment) but the regex doesn't have forward slashes

@brenwhyte are you able to file a ticket with AWS to get their input?

Answer 5 · 2022-01-17T15:39:07.000Z

I can, I'll get on that in 2 secs. I noticed on an earlier cluster that tags are the same, so this did work previously.

Lets see what AWS Support say.

Answer 6 · 2022-01-17T20:17:50.000Z

Has same errors with 'self-managed-node-group' example:

AutoScalingGroupName: "worker-group-2-20220117194210164800000003", │ Cause: "At 2022-01-17T19:49:11Z an instance was started in response to a difference between desired and actual capacity, increasing the capacity from 0 to 1.", │ Description: "Launching a new EC2 instance. Status Reason: 'k8s.io/cluster/self-test-eks-lU3GjdcB' is not a valid tag key. **Tag keys must match pattern ([0-9a-zA-Z\\\\-_+=,.@:]{1,255}), and must not be a reserved name ('.', '..', '_index')**. Launching EC2 instance failed.", │ Details: "{\"Subnet ID\":\"subnet-0f4c15ddb73b04b7a\",\"Availability Zone\":\"us-east-1a\"}", │ EndTime: 2022-01-17 19:49:12 +0000 UTC, │ Progress: 100, │ StartTime: 2022-01-17 19:49:12.43 +0000 UTC, │ StatusCode: "Failed", │ StatusMessage: "'k8s.io/cluster/self-test-eks-lU3GjdcB' is not a valid tag key. Tag keys must match pattern ([0-9a-zA-Z\\\\-_+=,.@:]{1,255}), and must not be a reserved name ('.', '..', '_index'). Launching EC2 instance failed." │ } │ │ with module.eks.module.self_managed_node_group["1"].aws_autoscaling_group.this[0], │ on .terraform/modules/eks/modules/self-managed-node-group/main.tf line 260, in resource "aws_autoscaling_group" "this": │ 260: resource "aws_autoscaling_group" "this" {

Answer 7 · 2022-01-18T00:55:10.000Z

The EKS managed node groups can no longer be created either.
Is this probably a similar problem?

Error: error creating EKS Node Group (<eks_cluster_name>:<eks_node_group_name>): InvalidRequestException: 'k8s.io/cluster-autoscaler/enabled' is not a valid tag key. Tag keys must match pattern ([0-9a-zA-Z\-+=,.@:]{1,255}), and must not be a reserved name ('.', '..', 'index') { RespMetadata: { StatusCode: 400, RequestID: "" }, Message: "'k8s.io/cluster-autoscaler/enabled' is not a valid tag key. Tag keys must match pattern ([0-9a-zA-Z\\-+=,.@:]{1,255}), and must not be a reserved name ('.', '..', '_index')" }

Answer 8 · 2022-01-18T01:10:32.000Z

this seems to be a change on the AWS side since these examples were working without issue. is anyone able to file a ticket with AWS support to get their feedback?

Answer 9 · 2022-01-18T01:14:08.000Z

By the way, it does not occur in 18.1.0, but in 18.2.0.

Answer 10 · 2022-01-18T01:26:06.000Z

Yes, the problems seem to be related to the instance_metadata_tags = "enabled". Once disabled the problem disapears

Answer 11 · 2022-01-18T01:39:21.000Z

Hmm, interesting. Thanks for identifying that

Answer 12 · 2022-01-18T02:23:52.000Z

I'm seeing the same issue. However, disabling instance_metadata_tags = "disabled" seemed to correct the issue for me as well.

Answer 13 · 2022-01-18T14:03:57.000Z

This issue has been resolved in version 18.2.1 🎉

Answer 14 · 2022-01-18T14:06:04.000Z

ok, we've changed the default behavior to disabled - disappointing that AWS has different tag requirements but we will leave it up to users to manage for now

Answer 15 · 2022-11-15T02:28:12.000Z

I'm going to lock this issue because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active issues. If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.