pre/post_bootstrap_user_data doesn't work anymore with AL2023
rgarrigue opened this issue · 9 comments
Description
I switched my EKSes managed node group to AMI_TYPE AL2023_x86_64_STANDARD (from AL2_x86_64 previously). Then my user_data stopped working, I can see this Unhandled unknown content-type
in journalctl -u cloud-init.service
Oct 21 11:29:03 localhost cloud-init[2783]: ci-info: +-------+-------------+---------+-----------+-------+
Oct 21 11:29:03 localhost cloud-init[2783]: ci-info: | Route | Destination | Gateway | Interface | Flags |
Oct 21 11:29:03 localhost cloud-init[2783]: ci-info: +-------+-------------+---------+-----------+-------+
Oct 21 11:29:03 localhost cloud-init[2783]: ci-info: | 0 | fe80::/64 | :: | enp39s0 | U |
Oct 21 11:29:03 localhost cloud-init[2783]: ci-info: | 2 | local | :: | enp39s0 | U |
Oct 21 11:29:03 localhost cloud-init[2783]: ci-info: | 3 | multicast | :: | enp39s0 | U |
Oct 21 11:29:03 localhost cloud-init[2783]: ci-info: +-------+-------------+---------+-----------+-------+
Oct 21 11:29:03 ip-10-20-10-69.eu-north-1.compute.internal cloud-init[2783]: 2024-10-21 11:29:03,539 - __init__.py[WARNING]: Unhandled unknown content-type (application/node.eks.aws) userdata: 'b'---'...'
Oct 21 11:29:04 ip-10-20-10-69.eu-north-1.compute.internal cloud-init[2783]: Generating public/private ed25519 key pair.
Oct 21 11:29:04 ip-10-20-10-69.eu-north-1.compute.internal cloud-init[2783]: Your identification has been saved in /etc/ssh/ssh_host_ed25519_key
Oct 21 11:29:04 ip-10-20-10-69.eu-north-1.compute.internal cloud-init[2783]: Your public key has been saved in /etc/ssh/ssh_host_ed25519_key.pub
And comparing with AL2 worker nodes, the part-001 & co script files are absent, aka the scripts/ folder is empty
/var/lib/cloud/instances/i-0faebac7b8b11778c/scripts
/var/lib/cloud/instances/i-0faebac7b8b11778c/scripts/part-001
/var/lib/cloud/instances/i-0faebac7b8b11778c/scripts/part-002
- ✋ I have searched the open/closed issues and my issue is not listed.
Versions
-
Module version [Required]: 20.24.2
-
Terraform version: ```Terraform v1.6.6
on linux_amd64
- provider registry.terraform.io/hashicorp/aws v5.72.1
- provider registry.terraform.io/hashicorp/cloudinit v2.3.5
- provider registry.terraform.io/hashicorp/kubernetes v2.21.1
- provider registry.terraform.io/hashicorp/null v3.2.3
- provider registry.terraform.io/hashicorp/time v0.12.1
- provider registry.terraform.io/hashicorp/tls v4.0.6```
- Provider version(s): Execute: terraform providers -version : same output as above (issue template to be updated ?)
Reproduction Code [Required]
Steps to reproduce the behavior:
module "eks" {
source = "terraform-aws-modules/eks/aws"
version = "20.26.0"
cluster_name = "test"
cluster_version = "1.31"
# Network
vpc_id = "vpc-0052643b5ded2cce4"
subnet_ids = ["subnet-0304ee0b265a7d4a3","subnet-0ee42ef7b5d2d5a71"]
cluster_endpoint_private_access = true
cluster_endpoint_public_access = true
# Addons
cluster_addons = {
coredns = {
most_recent = true
}
kube-proxy = {
most_recent = true
}
vpc-cni = {
most_recent = true
before_compute = true
}
}
eks_managed_node_group_defaults = {
ami_type = "AL2023_x86_64_STANDARD"
instance_types = ["c5.large"]
launch_template_name = "test"
attach_cluster_primary_security_group = true
iam_role_additional_policies = {
"ssm" : "arn:aws:iam::aws:policy/AmazonSSMManagedInstanceCore",
}
post_bootstrap_user_data = <<-EOT
echo
echo "Add ops' shared public key to '$(whoami)' user SSH's authorized_keys"
echo
groupadd ops
useradd -s /bin/bash -g ops ops
mkdir -p /home/ops/.ssh
chmod 0700 /home/ops/.ssh
echo "ssh-ed25519 AAAAC3Nz______________dQpkJ5 ops shared key" | tee /home/ops/.ssh/authorized_keys
chmod 0444 /home/ops/.ssh/authorized_keys
chown -R ops: /home/ops
echo "ops ALL=(ALL) NOPASSWD: ALL" | tee /etc/sudoers.d/ops
chmod 0400 /etc/sudoers.d/ops
EOT
}
eks_managed_node_groups = {
default = {
name = "test"
min_size = 1
max_size = 1
desired_size = 1
subnet_ids = ["subnet-0304ee0b265a7d4a3","subnet-0ee42ef7b5d2d5a71"]
block_device_mappings = {
xvda = {
device_name = "/dev/xvda"
ebs = {
volume_size = 100
volume_type = "gp3"
iops = 200
delete_on_termination = true
}
}
}
}
}
}
No workspace
Local cache cleared
List steps : replace AMI_TYPE value by AL2023_x86_64_STANDARD
Expected behavior
My user data to be executed, hence the ops user created, so with this ~/.ssh/config
host i-* mi-*
ProxyCommand sh -c "aws ssm start-session --target %h --document-name AWS-StartSSHSession --parameters 'portNumber=%p'"
StrictHostKeyChecking no
User ops
IdentityFile ops
I can
❯ ssh i-08042aefcc8bb7624
Updates Information Summary: available
1 Security notice(s)
1 Medium Security notice(s)
, #_
~\_ ####_ Amazon Linux 2023
~~ \_#####\
~~ \###|
~~ \#/ ___ https://aws.amazon.com/linux/amazon-linux-2023
~~ V~' '->
~~~ /
~~._. _/
_/ _/
_/m/'
Last login: Tue Oct 22 07:42:33 2024 from 127.0.0.1
Actual behavior
❯ ssh i-0485fe90afd97a39e
Warning: Permanently added 'i-0485fe90afd97a39e' (ED25519) to the list of known hosts.
Received disconnect from UNKNOWN port 65535:2: Too many authentication failures
Disconnected from UNKNOWN port 65535
I have to open the AWS console, go to EC2 instance, connect via SSM, sudo, execute my user data, and only then I can SSH in as intended behavior.
Edit
Fixed TF snippet, tried with module latest 20.26.0, not better
I can confirm this with module version 2.26, but also just from diving into the module code. The userdata for AL2023 completely ignores any values in the pre_bootstrap_user_data
and post_bootstrap_user_data
variables. I can see that the template file makes no reference to either variable.
Instead, completely new variables with new expected syntax were introduced: cloudinit_pre_nodeadm
and cloudinit_post_nodeadm
. I don't see these vars or the new behavior documented anywhere.
Is the intent to stop supporting the userdata vars in this module? Or was it an oversight to leave out those variables from the AL2023 template file?
Al2023 uses a different form of user data than AL2 -
terraform-aws-eks/tests/user-data/main.tf
Lines 108 to 210 in 97a08c8
@bryantbiggs Yes, and Windows also has a different form of user data than AL2, but they use the same module variables to build the templates. Are the concepts all that different between AL2 and AL2023? AL2023 seems to work the same way that AL2 works when specifying an AMI in the launch template. The only difference is an additional section for a NodeConfig
in its multipart MIME.
I think this is just a matter of broken docs and expectations, not broken code. The logic for shimming a userdata script into a multipart MIME was already in this module, and it used the same userdata variables employed in other scenarios. So despite the fact that the new variables work well and allow flexibility in building a custom multipart MIME message, it is a bit unexpected to have new variables, especially given that the userdata readme still suggests using the older ones.
I'm happy to make some readme update suggestions, though I'm not sure I quite understand the conditionals in the userdata module, and I've probably misunderstood something in the new AL2023 format anyway. If I've just misunderstood, then sorry. In any case, thanks for the time spent on this.
An updated README would suit me fine, my current problem is I don't know how to get started