InstanceIdNodeName causes bootstrapping to fail
Closed this issue · 10 comments
What happened:
Enabling the feature gate InstanceIdNodeName according to https://awslabs.github.io/amazon-eks-ami/nodeadm/doc/examples/#using-instance-id-as-node-name-experimental causes the bootstrapping to fail.
What you expected to happen:
To get an error message in ec2 system logs or to succeed
How to reproduce it (as minimally and precisely as possible):
Add to the NodeConfig
apiVersion: node.eks.aws/v1alpha1
kind: NodeConfig
spec:
kubelet:
config:
featureGates:
InstanceIdNodeName: true
Anything else we need to know?:
Tested both in Cloudwatch AutoscalingGroup and Karpenter nodes as well, same behaviour.
The Node IAM role is in place as described.
Environment:
- AWS Region: us-east-2
- Instance Type(s): t3, r6
- Kubernetes version (use
aws eks describe-cluster --name <name> --query cluster.version
): 1.28 - AMI Version: amazon-eks-node-al2023-x86_64-standard-1.28-v20240514
The problem was mine - the featureGate should not be placed under kublet config but rather under the NodeConfig spec.
Actually, the success / failures are more like random
@universam1 please let us know if you run into any other issues with this feature gate, we want to make this the default in AL2025 👍
@cartermckinnon Thanks for checking back - I take that back it was not my config error but that it this issue happens occasionally, like 30% success chance.
I can replicate that with InstanceIdNodeName: true
enabled the bootstrapping fails for ~70% of all tested instances, so far tried around 30 instances. This holds true for Karpenter and ASG created nodes.
I tried to get some logs via journalctl -u nodeadm-config -u nodeadmrun
:
[ 9.620333] cloud-init[1684]: 2024-05-29 08:27:46,748 - __init__.py[WARNING]: Unhandled unknown content-type (application/node.eks.aws) userdata: 'b'---'...'
[ 10.700846] cloud-init[1684]: Generating public/private ed25519 key pair.
[ 10.703543] cloud-init[1684]: Your identification has been saved in /etc/ssh/ssh_host_ed25519_key
[ 10.706692] cloud-init[1684]: Your public key has been saved in /etc/ssh/ssh_host_ed25519_key.pub
[ 10.709952] cloud-init[1684]: The key fingerprint is:
[ 10.712020] cloud-init[1684]: SHA256:3OYLSixP8miLpK5gaAIFp2yvU37bgGlaZUa5xxid6Xs root@ip-172-31-6-6.us-east-2.compute.internal
[ 10.716089] cloud-init[1684]: The key's randomart image is:
[ 10.718096] cloud-init[1684]: +--[ED25519 256]--+
[ 10.743340] cloud-init[1684]: +----[SHA256]-----+
[ 10.745107] cloud-init[1684]: Generating public/private ecdsa key pair.
[ 10.747522] cloud-init[1684]: Your identification has been saved in /etc/ssh/ssh_host_ecdsa_key
[ 10.750686] cloud-init[1684]: Your public key has been saved in /etc/ssh/ssh_host_ecdsa_key.pub
[ 10.753854] cloud-init[1684]: The key fingerprint is:
[ 10.755764] cloud-init[1684]: SHA256:AWkuXbJbrdwR0SVYSDIpUZ0clRp287RZzNia8ZApf7U root@ip-172-31-6-6.us-east-2.compute.internal
[ 10.793688] cloud-init[1684]: The key's randomart image is:
[ 10.800150] cloud-init[1684]: +---[ECDSA 256]---+
[ 10.833930] cloud-init[1684]: +----[SHA256]-----+
[ 10.892742] clocksource: Switched to clocksource kvm-clock
[ 11.355855] cloud-init[1822]: Cloud-init v. 22.2.2 running 'modules:config' at Wed, 29 May 2024 08:27:48 +0000. Up 11.21 seconds.
[ 12.022827] cloud-init[1829]: Cloud-init v. 22.2.2 running 'modules:final' at Wed, 29 May 2024 08:27:49 +0000. Up 11.91 seconds.
[ 12.145425] cloud-init[1829]: + /opt/aws/bin/cfn-signal --exit-code 0 --stack o11n-eks-int-3794 --resource NodesPrimaryAutoscalingAutoscalinggroup --region us-east-2
[ 13.001003] cloud-init[1829]: + echo 'All done'
[ 13.003089] cloud-init[1829]: All done
[ 13.004695] cloud-init[1829]: + journalctl -u nodeadm-config -u nodeadmrun
[ 13.010058] cloud-init[1829]: May 29 08:27:42 localhost systemd[1]: Starting nodeadm-config.service - EKS Nodeadm Config...
[ 13.014142] cloud-init[1829]: May 29 08:27:42 localhost nodeadm[1619]: {"level":"info","ts":1716971262.7553272,"caller":"init/init.go:49","msg":"Checking user is root.."}
[ 13.030643] cloud-init[1829]: May 29 08:27:42 localhost nodeadm[1619]: {"level":"info","ts":1716971262.7554145,"caller":"init/init.go:57","msg":"Loading configuration..","configSource":"imds://user-data"}
[ 13.040177] cloud-init[1829]: May 29 08:27:42 localhost nodeadm[1619]: {"level":"info","ts":1716971262.766488,"caller":"init/init.go:66","msg":"Loaded configuration","config":{"metadata":{"creationTimestamp":null},"spec":{"cluster":{"name":"o11n-eks-int-3794","apiServerEndpoint":"https://61371D8EAC5B135B80A2668214A30315.gr7.us-east-2.eks.amazonaws.com","certificateAuthority":"LS0...","cidr":"10.100.0.0/16"},"containerd":{},"instance":{"localStorage":{}},"kubelet":{"config":{"clusterDNS":["10.100.0.10"],"featureG[2024-05-29T08:27:51.267748]ates":{"DisableKubeletCloudCredentialProviders":true},"registerWithTaints":[{"effect":"NoExecute","key":"node.cilium.io/agent-not-ready","value":"true"},{"effect":"NoSchedule","key":"primary-nodegroup","value":"true"}],"registryPullQPS":100,"serializeImagePulls":false,"shutdownGracePeriod":"30s"}},"featureGates":{"InstanceIdNodeName":true}},"status":{"instance":{},"default":{}}}}
ci-info: no authorized SSH keys fingerprints found for user ec2-user.
[ 13.144308] cloud-init[1829]: May 29 08:27:42 localhost nodeadm[1619]: {"level":"info","ts":1716971262.766717,"caller":"init/init.go:68","msg":"Enriching configuration.."}
[ 13.160123] cloud-init[1829]: May 29 08:27:42 localhost nodeadm[1619]: {"level":"info","ts":1716971262.7667294,"caller":"init/init.go:148","msg":"Fetching instance details.."}
[ 13.166056] cloud-init[1829]: May 29 08:27:42 localhost nodeadm[1619]: {"level":"info","ts":1716971262.7800841,"caller":"init/init.go:161","msg":"Instance details populated","details":{"id":"i-0056120976d3d3f8b","region":"us-east-2","type":"r6a.large","availabilityZone":"us-east-2a","mac":"02:31:fd:40:d5:8d"}}
[ 13.179895] cloud-init[1829]: May 29 08:27:42 localhost nodeadm[1619]: {"level":"info","ts":1716971262.7801116,"caller":"init/init.go:162","msg":"Fetching default options..."}
<14>May 29 08:27:50 cloud-init: #############################################################
<14>May 29 08:27:50 cloud-init: -----BEGIN SSH HOST KEY FINGERPRINTS-----
<14>May 29 08:27:50 cloud-init: 256 SHA256:AWkuXbJbrdwR0SVYSDIpUZ0clRp287RZzNia8ZApf7U root@ip-172-31-6-6.us-east-2.compute.internal (ECDSA)
<14>May 29 08:27:50 cloud-init: 256 SHA256:3OYLSixP8miLpK5gaAIFp2yvU37bgGlaZUa5xxid6Xs root@ip-172-31-6-6.us-east-2.compute.internal (ED25519)
<14>May 29 08:27:50 cloud-init: -----END SSH HOST KEY FINGERPRINTS-----
<14>May 29 08:27:50 cloud-init: #############################################################
-----BEGIN SSH HOST KEY KEYS-----
ecdsa-sha2-nistp256 AAAAE2VjZHNhLXNoYTItbmlzdHAyNTYAAAAIbmlzdHAyNTYAAABBBH/lPyXduBOdQD4HzyJEN+qPNwAFM9IEQT2awVu7UVyPrc3+Nf9pRN3kuG7YQeHJYrRrF2AluHsa7740EsXxXlE= root@ip-172-31-6-6.us-east-2.compute.internal
ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAIAXf1d8nqHUjszdZeqzZ134kJXGbHGYvNf/ff+Ja5JjH root@ip-172-31-6-6.us-east-2.compute.internal
-----END SSH HOST KEY KEYS-----
[ 13.226001] cloud-init[1829]: May 29 08:27:42 localhost nodeadm[1619]: {"level":"info","ts":1716971262.7808166,"caller":"init/init.go:170","msg":"Default options populated","defaults":{"sandboxImage":"602401143452.dkr.ecr.us-east-2.amazonaws.com/eks/pause:3.5"}}
[ 13.235135] cloud-init[1829]: May 29 08:27:42 localhost nodeadm[1619]: {"level":"info","ts":1716971262.7808402,"caller":"init/init.go:73","msg":"Validating configuration.."}
[ 13.250068] cloud-init[1829]: May 29 08:27:42 localhost nodeadm[1619]: {"level":"info","ts":1716971262.7808487,"caller":"init/init.go:78","msg":"Creating daemon manager.."}
[ 13.255947] cloud-init[1829]: May 29 08:27:42 localhost nodeadm[1619]: {"level":"info","ts":1716971262.7821825,"caller":"init/init.go:96","msg":"Configuring daemons..."}
[ 13.261879] cloud-init[1829]: May 29 08:27:42 localhost nodeadm[1619]: {"level":"info","ts":1716971262.782202,"caller":"init/init.go:103","msg":"Configuring daemon...","name":"containerd"}
[ 13.268899] cloud-init[1829]: May 29 08:27:42 localhost nodeadm[1619]: {"level":"info","ts":1716971262.7822359,"caller":"containerd/config.go:51","msg":"Writing containerd config to file..","path":"/etc/containerd/config.toml"}
[ 13.275789] cloud-init[1829]: May 29 08:27:42 localhost nodeadm[1619]: {"level":"info","ts":1716971262.790485,"caller":"init/init.go:107","msg":"Configured daemon","name":"containerd"}
[ 13.290099] cloud-init[1829]: May 29 08:27:42 localhost nodeadm[1619]: {"level":"info","ts":1716971262.790512,"caller":"init/init.go:103","msg":"Configuring daemon...","name":"kubelet"}
[ 13.296218] cloud-init[1829]: May 29 08:27:44 localhost nodeadm[1619]: {"level":"info","ts":1716971264.848445,"caller":"kubelet/config.go:300","msg":"Detected kubelet version","version":"v1.28.8"}
[ 13.310080] cloud-init[1829]: May 29 08:27:44 localhost nodeadm[1619]: {"level":"info","ts":1716971264.8501525,"caller":"kubelet/config.go:211","msg":"Setup IP for node","ip":"172.31.6.6"}
[ 13.316543] cloud-init[1829]: May 29 08:27:44 localhost nodeadm[1619]: {"level":"info","ts":1716971264.8501852,"caller":"kubelet/config.go:247","msg":"Opt-in Instance Id naming strategy"}
[ 13.323706] cloud-init[1829]: May 29 08:27:44 localhost nodeadm[1619]: {"level":"info","ts":1716971264.8509867,"caller":"kubelet/config.go:351","msg":"Writing kubelet config to file..","path":"/etc/kubernetes/kubelet/config.json"}
2024/05/29 08:27:50Z: Amazon SSM Agent v3.3.380.0 is running
2024/05/29 08:27:50Z: OsProductName: Amazon Linux
2024/05/29 08:27:50Z: OsVersion: 2023
[ 13.340098] cloud-init[1829]: May 29 08:27:45 localhost nodeadm[1619]: {"level":"info","ts":1716971265.0433922,"caller":"init/init.go:107","msg":"Configured daemon","name":"kubelet"}
[ 13.346054] cloud-init[1829]: May 29 08:27:45 localhost systemd[1]: nodeadm-config.service: Deactivated successfully.
[ 13.349967] cloud-init[1829]: May 29 08:27:45 localhost systemd[1]: Finished nodeadm-config.service - EKS Nodeadm Config.
[ 13.353986] cloud-init[1829]: Cloud-init v. 22.2.2 finished at Wed, 29 May 2024 08:27:50 +0000. Datasource DataSourceEc2. Up 13.23 seconds
Amazon Linux 2023.4.20240513
Kernel 6.1.90-99.173.amzn2023.x86_64 on an x86_64 (-)
Wonder if the first error Unhandled unknown content-type (application/node.eks.aws) userdata: 'b'---'...'
plays a role here
Wonder if the first error Unhandled unknown content-type (application/node.eks.aws) userdata: 'b'---'...' plays a role here
That's cloud-init
saying it doesn't have a handler registered for the Content-Type
, just a warning.
The nodeadm-config
unit ran successfully AFAICT, can you grab logs for the nodeadm-run
unit?
journalctl -u nodeadm-run
If that looks sane, take a look at kubelet
and containerd
. If you want to open a case with AWS Support, we can get a more complete snapshot of the instance with this: https://github.com/awslabs/amazon-eks-ami/tree/main/log-collector-script/linux
@cartermckinnon I believe I'm down to the cause, which also explains my random results.
The problem is the necessary change in the configmap/aws-auth
!
When changing it on existing clusters, it is a breaking change! The existing nodes are then blocked accessing the API because there is a conflict in the source name. These errors are visible in the kubelet logs.
- username: system:node:{{EC2PrivateDNSName}}
+ username: system:node:{{SessionName}}
When adding that section instead, this is also not working, looks like the aws-auth
does not support defining the rolearn
twice with different username
templates.
- rolearn: {{ .nodeiamrole }}
username: system:node:{{EC2PrivateDNSName}}
groups:
- system:bootstrappers
- system:nodes
+ - rolearn: {{ .nodeiamrole }}
+ username: system:node:{{SessionName}}
+ groups:
+ - system:bootstrappers
+ - system:nodes
So we have no migration path currently!
IMHO the problem here is the aws-auth
which should support both types of usernames. Or the username, if possible, should stay as it is and not be changed by this feature here.
what about using cluster access entry instead of the aws-auth
configMap?
@universam1 Sorry for the delayed response -- you'll have to create a new IAM role for your nodes in order to migrate to this behavior, since you can't have have the same AWS principal mapped to 2 different Kubernetes users. If you're using managed nodegroups, this means creating a new nodegroup with the new node role. If you're using Karpenter, you can just update the role in the EC2NodeClass at the same time you enable this feature gate.
I'll get our docs updated to cover this.
what about using cluster access entry instead of the aws-auth configMap?
+1 for this question, as I can see in "EKS API" auth mode:
- for
type=EC2_LINUX
user_name is hardcoded tosystem:node:{{EC2PrivateDNSName}}
(and we needsystem:node:{{SessionName}}
) - for
type=STANDARD
one cannot setkubernetes_groups
tosystem:nodes
:
Error: creating EKS Access Entry (arn): operation error EKS: CreateAccessEntry, https response error StatusCode: 400, RequestID: 47ad8e95-4d6b-4570-893b-2fdfc3bd9295, InvalidParameterException: The kubernetes group name system:nodes is invalid, it cannot start with system:
Update:
https://docs.aws.amazon.com/eks/latest/userguide/creating-access-entries.html
it actually works fine with type=FARGATE_LINUX
, maybe worth adding to nodeadm
docs too
@sepich yep that will work for now! We're going to add a type specifically for this before this becomes the default behavior