awslabs/amazon-eks-ami

InstanceIdNodeName causes bootstrapping to fail

Closed this issue · 10 comments

What happened:

Enabling the feature gate InstanceIdNodeName according to https://awslabs.github.io/amazon-eks-ami/nodeadm/doc/examples/#using-instance-id-as-node-name-experimental causes the bootstrapping to fail.

What you expected to happen:

To get an error message in ec2 system logs or to succeed

How to reproduce it (as minimally and precisely as possible):

Add to the NodeConfig

apiVersion: node.eks.aws/v1alpha1
kind: NodeConfig
spec:
  kubelet:
    config:
      featureGates:
        InstanceIdNodeName: true

Anything else we need to know?:

Tested both in Cloudwatch AutoscalingGroup and Karpenter nodes as well, same behaviour.

The Node IAM role is in place as described.

Environment:

  • AWS Region: us-east-2
  • Instance Type(s): t3, r6
  • Kubernetes version (use aws eks describe-cluster --name <name> --query cluster.version): 1.28
  • AMI Version: amazon-eks-node-al2023-x86_64-standard-1.28-v20240514

The problem was mine - the featureGate should not be placed under kublet config but rather under the NodeConfig spec.
Actually, the success / failures are more like random

@universam1 please let us know if you run into any other issues with this feature gate, we want to make this the default in AL2025 👍

@cartermckinnon Thanks for checking back - I take that back it was not my config error but that it this issue happens occasionally, like 30% success chance.
I can replicate that with InstanceIdNodeName: true enabled the bootstrapping fails for ~70% of all tested instances, so far tried around 30 instances. This holds true for Karpenter and ASG created nodes.

I tried to get some logs via journalctl -u nodeadm-config -u nodeadmrun:

[    9.620333] cloud-init[1684]: 2024-05-29 08:27:46,748 - __init__.py[WARNING]: Unhandled unknown content-type (application/node.eks.aws) userdata: 'b'---'...'
[   10.700846] cloud-init[1684]: Generating public/private ed25519 key pair.
[   10.703543] cloud-init[1684]: Your identification has been saved in /etc/ssh/ssh_host_ed25519_key
[   10.706692] cloud-init[1684]: Your public key has been saved in /etc/ssh/ssh_host_ed25519_key.pub
[   10.709952] cloud-init[1684]: The key fingerprint is:
[   10.712020] cloud-init[1684]: SHA256:3OYLSixP8miLpK5gaAIFp2yvU37bgGlaZUa5xxid6Xs root@ip-172-31-6-6.us-east-2.compute.internal
[   10.716089] cloud-init[1684]: The key's randomart image is:
[   10.718096] cloud-init[1684]: +--[ED25519 256]--+

[   10.743340] cloud-init[1684]: +----[SHA256]-----+
[   10.745107] cloud-init[1684]: Generating public/private ecdsa key pair.
[   10.747522] cloud-init[1684]: Your identification has been saved in /etc/ssh/ssh_host_ecdsa_key
[   10.750686] cloud-init[1684]: Your public key has been saved in /etc/ssh/ssh_host_ecdsa_key.pub
[   10.753854] cloud-init[1684]: The key fingerprint is:
[   10.755764] cloud-init[1684]: SHA256:AWkuXbJbrdwR0SVYSDIpUZ0clRp287RZzNia8ZApf7U root@ip-172-31-6-6.us-east-2.compute.internal
[   10.793688] cloud-init[1684]: The key's randomart image is:
[   10.800150] cloud-init[1684]: +---[ECDSA 256]---+

[   10.833930] cloud-init[1684]: +----[SHA256]-----+
[   10.892742] clocksource: Switched to clocksource kvm-clock
[   11.355855] cloud-init[1822]: Cloud-init v. 22.2.2 running 'modules:config' at Wed, 29 May 2024 08:27:48 +0000. Up 11.21 seconds.
[   12.022827] cloud-init[1829]: Cloud-init v. 22.2.2 running 'modules:final' at Wed, 29 May 2024 08:27:49 +0000. Up 11.91 seconds.
[   12.145425] cloud-init[1829]: + /opt/aws/bin/cfn-signal --exit-code 0 --stack o11n-eks-int-3794 --resource NodesPrimaryAutoscalingAutoscalinggroup --region us-east-2
[   13.001003] cloud-init[1829]: + echo 'All done'
[   13.003089] cloud-init[1829]: All done
[   13.004695] cloud-init[1829]: + journalctl -u nodeadm-config -u nodeadmrun
[   13.010058] cloud-init[1829]: May 29 08:27:42 localhost systemd[1]: Starting nodeadm-config.service - EKS Nodeadm Config...
[   13.014142] cloud-init[1829]: May 29 08:27:42 localhost nodeadm[1619]: {"level":"info","ts":1716971262.7553272,"caller":"init/init.go:49","msg":"Checking user is root.."}
[   13.030643] cloud-init[1829]: May 29 08:27:42 localhost nodeadm[1619]: {"level":"info","ts":1716971262.7554145,"caller":"init/init.go:57","msg":"Loading configuration..","configSource":"imds://user-data"}
[   13.040177] cloud-init[1829]: May 29 08:27:42 localhost nodeadm[1619]: {"level":"info","ts":1716971262.766488,"caller":"init/init.go:66","msg":"Loaded configuration","config":{"metadata":{"creationTimestamp":null},"spec":{"cluster":{"name":"o11n-eks-int-3794","apiServerEndpoint":"https://61371D8EAC5B135B80A2668214A30315.gr7.us-east-2.eks.amazonaws.com","certificateAuthority":"LS0...","cidr":"10.100.0.0/16"},"containerd":{},"instance":{"localStorage":{}},"kubelet":{"config":{"clusterDNS":["10.100.0.10"],"featureG[2024-05-29T08:27:51.267748]ates":{"DisableKubeletCloudCredentialProviders":true},"registerWithTaints":[{"effect":"NoExecute","key":"node.cilium.io/agent-not-ready","value":"true"},{"effect":"NoSchedule","key":"primary-nodegroup","value":"true"}],"registryPullQPS":100,"serializeImagePulls":false,"shutdownGracePeriod":"30s"}},"featureGates":{"InstanceIdNodeName":true}},"status":{"instance":{},"default":{}}}}
ci-info: no authorized SSH keys fingerprints found for user ec2-user.
[   13.144308] cloud-init[1829]: May 29 08:27:42 localhost nodeadm[1619]: {"level":"info","ts":1716971262.766717,"caller":"init/init.go:68","msg":"Enriching configuration.."}
[   13.160123] cloud-init[1829]: May 29 08:27:42 localhost nodeadm[1619]: {"level":"info","ts":1716971262.7667294,"caller":"init/init.go:148","msg":"Fetching instance details.."}
[   13.166056] cloud-init[1829]: May 29 08:27:42 localhost nodeadm[1619]: {"level":"info","ts":1716971262.7800841,"caller":"init/init.go:161","msg":"Instance details populated","details":{"id":"i-0056120976d3d3f8b","region":"us-east-2","type":"r6a.large","availabilityZone":"us-east-2a","mac":"02:31:fd:40:d5:8d"}}
[   13.179895] cloud-init[1829]: May 29 08:27:42 localhost nodeadm[1619]: {"level":"info","ts":1716971262.7801116,"caller":"init/init.go:162","msg":"Fetching default options..."}
<14>May 29 08:27:50 cloud-init: #############################################################
<14>May 29 08:27:50 cloud-init: -----BEGIN SSH HOST KEY FINGERPRINTS-----
<14>May 29 08:27:50 cloud-init: 256 SHA256:AWkuXbJbrdwR0SVYSDIpUZ0clRp287RZzNia8ZApf7U root@ip-172-31-6-6.us-east-2.compute.internal (ECDSA)
<14>May 29 08:27:50 cloud-init: 256 SHA256:3OYLSixP8miLpK5gaAIFp2yvU37bgGlaZUa5xxid6Xs root@ip-172-31-6-6.us-east-2.compute.internal (ED25519)
<14>May 29 08:27:50 cloud-init: -----END SSH HOST KEY FINGERPRINTS-----
<14>May 29 08:27:50 cloud-init: #############################################################
-----BEGIN SSH HOST KEY KEYS-----
ecdsa-sha2-nistp256 AAAAE2VjZHNhLXNoYTItbmlzdHAyNTYAAAAIbmlzdHAyNTYAAABBBH/lPyXduBOdQD4HzyJEN+qPNwAFM9IEQT2awVu7UVyPrc3+Nf9pRN3kuG7YQeHJYrRrF2AluHsa7740EsXxXlE= root@ip-172-31-6-6.us-east-2.compute.internal
ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAIAXf1d8nqHUjszdZeqzZ134kJXGbHGYvNf/ff+Ja5JjH root@ip-172-31-6-6.us-east-2.compute.internal
-----END SSH HOST KEY KEYS-----
[   13.226001] cloud-init[1829]: May 29 08:27:42 localhost nodeadm[1619]: {"level":"info","ts":1716971262.7808166,"caller":"init/init.go:170","msg":"Default options populated","defaults":{"sandboxImage":"602401143452.dkr.ecr.us-east-2.amazonaws.com/eks/pause:3.5"}}
[   13.235135] cloud-init[1829]: May 29 08:27:42 localhost nodeadm[1619]: {"level":"info","ts":1716971262.7808402,"caller":"init/init.go:73","msg":"Validating configuration.."}
[   13.250068] cloud-init[1829]: May 29 08:27:42 localhost nodeadm[1619]: {"level":"info","ts":1716971262.7808487,"caller":"init/init.go:78","msg":"Creating daemon manager.."}
[   13.255947] cloud-init[1829]: May 29 08:27:42 localhost nodeadm[1619]: {"level":"info","ts":1716971262.7821825,"caller":"init/init.go:96","msg":"Configuring daemons..."}
[   13.261879] cloud-init[1829]: May 29 08:27:42 localhost nodeadm[1619]: {"level":"info","ts":1716971262.782202,"caller":"init/init.go:103","msg":"Configuring daemon...","name":"containerd"}
[   13.268899] cloud-init[1829]: May 29 08:27:42 localhost nodeadm[1619]: {"level":"info","ts":1716971262.7822359,"caller":"containerd/config.go:51","msg":"Writing containerd config to file..","path":"/etc/containerd/config.toml"}
[   13.275789] cloud-init[1829]: May 29 08:27:42 localhost nodeadm[1619]: {"level":"info","ts":1716971262.790485,"caller":"init/init.go:107","msg":"Configured daemon","name":"containerd"}
[   13.290099] cloud-init[1829]: May 29 08:27:42 localhost nodeadm[1619]: {"level":"info","ts":1716971262.790512,"caller":"init/init.go:103","msg":"Configuring daemon...","name":"kubelet"}
[   13.296218] cloud-init[1829]: May 29 08:27:44 localhost nodeadm[1619]: {"level":"info","ts":1716971264.848445,"caller":"kubelet/config.go:300","msg":"Detected kubelet version","version":"v1.28.8"}
[   13.310080] cloud-init[1829]: May 29 08:27:44 localhost nodeadm[1619]: {"level":"info","ts":1716971264.8501525,"caller":"kubelet/config.go:211","msg":"Setup IP for node","ip":"172.31.6.6"}
[   13.316543] cloud-init[1829]: May 29 08:27:44 localhost nodeadm[1619]: {"level":"info","ts":1716971264.8501852,"caller":"kubelet/config.go:247","msg":"Opt-in Instance Id naming strategy"}
[   13.323706] cloud-init[1829]: May 29 08:27:44 localhost nodeadm[1619]: {"level":"info","ts":1716971264.8509867,"caller":"kubelet/config.go:351","msg":"Writing kubelet config to file..","path":"/etc/kubernetes/kubelet/config.json"}
2024/05/29 08:27:50Z: Amazon SSM Agent v3.3.380.0 is running
2024/05/29 08:27:50Z: OsProductName: Amazon Linux
2024/05/29 08:27:50Z: OsVersion: 2023
[   13.340098] cloud-init[1829]: May 29 08:27:45 localhost nodeadm[1619]: {"level":"info","ts":1716971265.0433922,"caller":"init/init.go:107","msg":"Configured daemon","name":"kubelet"}
[   13.346054] cloud-init[1829]: May 29 08:27:45 localhost systemd[1]: nodeadm-config.service: Deactivated successfully.
[   13.349967] cloud-init[1829]: May 29 08:27:45 localhost systemd[1]: Finished nodeadm-config.service - EKS Nodeadm Config.
[   13.353986] cloud-init[1829]: Cloud-init v. 22.2.2 finished at Wed, 29 May 2024 08:27:50 +0000. Datasource DataSourceEc2.  Up 13.23 seconds
Amazon Linux 2023.4.20240513
Kernel 6.1.90-99.173.amzn2023.x86_64 on an x86_64 (-)

Wonder if the first error Unhandled unknown content-type (application/node.eks.aws) userdata: 'b'---'...' plays a role here

Wonder if the first error Unhandled unknown content-type (application/node.eks.aws) userdata: 'b'---'...' plays a role here

That's cloud-init saying it doesn't have a handler registered for the Content-Type, just a warning.

The nodeadm-config unit ran successfully AFAICT, can you grab logs for the nodeadm-run unit?

journalctl -u nodeadm-run

If that looks sane, take a look at kubelet and containerd. If you want to open a case with AWS Support, we can get a more complete snapshot of the instance with this: https://github.com/awslabs/amazon-eks-ami/tree/main/log-collector-script/linux

@cartermckinnon I believe I'm down to the cause, which also explains my random results.

The problem is the necessary change in the configmap/aws-auth!

When changing it on existing clusters, it is a breaking change! The existing nodes are then blocked accessing the API because there is a conflict in the source name. These errors are visible in the kubelet logs.

- username: system:node:{{EC2PrivateDNSName}}
+ username: system:node:{{SessionName}}

When adding that section instead, this is also not working, looks like the aws-auth does not support defining the rolearn twice with different username templates.

    - rolearn: {{ .nodeiamrole }}
      username: system:node:{{EC2PrivateDNSName}}
      groups:
        - system:bootstrappers
        - system:nodes
+    - rolearn: {{ .nodeiamrole }}
+      username: system:node:{{SessionName}}
+      groups:
+        - system:bootstrappers
+        - system:nodes

So we have no migration path currently!

IMHO the problem here is the aws-auth which should support both types of usernames. Or the username, if possible, should stay as it is and not be changed by this feature here.

what about using cluster access entry instead of the aws-auth configMap?

@universam1 Sorry for the delayed response -- you'll have to create a new IAM role for your nodes in order to migrate to this behavior, since you can't have have the same AWS principal mapped to 2 different Kubernetes users. If you're using managed nodegroups, this means creating a new nodegroup with the new node role. If you're using Karpenter, you can just update the role in the EC2NodeClass at the same time you enable this feature gate.

I'll get our docs updated to cover this.

what about using cluster access entry instead of the aws-auth configMap?

+1 for this question, as I can see in "EKS API" auth mode:

  • for type=EC2_LINUX user_name is hardcoded to system:node:{{EC2PrivateDNSName}} (and we need system:node:{{SessionName}})
  • for type=STANDARD one cannot set kubernetes_groups to system:nodes:
Error: creating EKS Access Entry (arn): operation error EKS: CreateAccessEntry, https response error StatusCode: 400, RequestID: 47ad8e95-4d6b-4570-893b-2fdfc3bd9295, InvalidParameterException: The kubernetes group name system:nodes is invalid, it cannot start with system:

Update:
https://docs.aws.amazon.com/eks/latest/userguide/creating-access-entries.html
it actually works fine with type=FARGATE_LINUX, maybe worth adding to nodeadm docs too

@sepich yep that will work for now! We're going to add a type specifically for this before this becomes the default behavior