awslabs/amazon-eks-ami

Nodeadm-run service fails if containerd is not fully initialized and running

ajvn opened this issue · 9 comments

ajvn commented

What happened:
We are building our own AMIs with preloaded critical container images. This helps us speed up node getting into a ready state, and we have those images available on the node itself in case registries are down.

However, this slows down startup of containerd a little bit, and if it's not fully up and running before nodeadm tries to pull pause image, nodeadm-run service will fail, and node won't join the cluster, while it remains up and running, and will have to be removed manually.
Other option is getting into the node itself and restarting nodeadm-run service. After that it will join the cluster, but
this obviously is not a scale-friendly solution.

This happens randomly, so far on average it affects every 4th node joining the cluster.

What you expected to happen:
It looks like nodeadm tries to pull pause image 3 times, after that it exits with an error:

It would be good if we could adjust how many times it should retry and/or how often it should retry via configuration option or a flag (preferably configuration option, so we don't have to adjust systemd unit file).

Potentially we could try adding containerd.service to the After= section of systemd unit file, but I don't know if this helps as we need containerd to be fully up and running, not only being active according to the systemd.

[Unit]
Description=EKS Nodeadm Run
Documentation=https://github.com/awslabs/amazon-eks-ami
# start after cloud-init, in order to pickup changes the
# user may have applied via cloud-init scripts
After=nodeadm-config.service cloud-final.service <= adding containerd.service here
Requires=nodeadm-config.service

[Service]
Type=oneshot
ExecStart=/usr/bin/nodeadm init --skip config

[Install]
WantedBy=multi-user.target

How to reproduce it (as minimally and precisely as possible):
Have containerd not be fully ready before nodeadm-run service executes.

Anything else we need to know?:
Here's some additional information which helped me during the investigation:

$ sudo journalctl --no-pager -b -u nodeadm-run
...
Aug 10 04:22:54 nodeadm[3026]: {"level":"info","ts":1723263774.9953516,"caller":"containerd/sandbox.go:48","msg":"Pulling sandbox image..","image":"602401143452.dkr.ecr.eu-west-1.amazonaws.com/eks/pause:3.5"}
Aug 10 04:22:54 nodeadm[3026]: E0810 04:22:54.996931    3026 remote_image.go:135] PullImage "602401143452.dkr.ecr.eu-west-1.amazonaws.com/eks/pause:3.5" from image service failed: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing: dial unix /run/containerd/containerd.sock: connect: no such file or directory"
Aug 10 04:22:56 nodeadm[3026]: {"level":"info","ts":1723263776.997649,"caller":"containerd/sandbox.go:48","msg":"Pulling sandbox image..","image":"602401143452.dkr.ecr.eu-west-1.amazonaws.com/eks/pause:3.5"}
Aug 10 04:22:56 nodeadm[3026]: E0810 04:22:56.997752    3026 remote_image.go:135] PullImage "602401143452.dkr.ecr.eu-west-1.amazonaws.com/eks/pause:3.5" from image service failed: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing: dial unix /run/containerd/containerd.sock: connect: no such file or directory"
Aug 10 04:23:00 nodeadm[3026]: {"level":"info","ts":1723263780.998839,"caller":"containerd/sandbox.go:48","msg":"Pulling sandbox image..","image":"602401143452.dkr.ecr.eu-west-1.amazonaws.com/eks/pause:3.5"}
Aug 10 04:23:01 nodeadm[3026]: E0810 04:23:01.003106    3026 remote_image.go:135] PullImage "602401143452.dkr.ecr.eu-west-1.amazonaws.com/eks/pause:3.5" from image service failed: rpc error: code = Unknown desc = server is not initialized yet
Aug 10 04:23:09 nodeadm[3026]: {"level":"fatal","ts":1723263789.010438,"caller":"nodeadm/main.go:36","msg":"Command failed","error":"rpc error: code = Unknown desc = server is not initialized yet","stacktrace":"main.main\n\t/workdir/cmd/nodeadm/main.go:36\nruntime.main\n\t/root/sdk/go1.21.9/src/runtime/proc.go:267"}
Aug 10 04:23:09 systemd[1]: nodeadm-run.service: Main process exited, code=exited, status=1/FAILURE
Aug 10 04:23:09 systemd[1]: nodeadm-run.service: Failed with result 'exit-code'.
Aug 10 04:23:09 systemd[1]: Failed to start nodeadm-run.service - EKS Nodeadm Run.

Mainly related to times when containerd and nodeadm-run services started:

● containerd.service - containerd container runtime
     Loaded: loaded (/usr/lib/systemd/system/containerd.service; disabled; preset: disabled)
    Drop-In: /etc/systemd/system/containerd.service.d
             └─00-runtime-slice.conf
     Active: active (running) since Sat 2024-08-10 04:23:01 UTC; 2h 14min ago
       Docs: https://containerd.io
    Process: 3050 ExecStartPre=/sbin/modprobe overlay (code=exited, status=0/SUCCESS)
   Main PID: 3058 (containerd)
      Tasks: 89
     Memory: 29.8M
        CPU: 4.262s
     CGroup: /runtime.slice/containerd.service
             └─3058 /usr/bin/containerd

Warning: some journal files were not opened due to insufficient permissions.
---
× nodeadm-run.service - EKS Nodeadm Run
     Loaded: loaded (/etc/systemd/system/nodeadm-run.service; enabled; preset: disabled)
     Active: failed (Result: exit-code) since Sat 2024-08-10 04:23:09 UTC; 2h 14min ago
       Docs: https://github.com/awslabs/amazon-eks-ami
    Process: 3026 ExecStart=/usr/bin/nodeadm init --skip config (code=exited, status=1/FAILURE)
   Main PID: 3026 (code=exited, status=1/FAILURE)
        CPU: 107ms

Warning: some journal files were not opened due to insufficient permissions.

Environment:
I don't believe it's relevant in this case, but if requested I am happy to oblige.

ajvn commented

Workaround for now is replacing default systemd unit file with this, added to node launch template:

...
--BOUNDARY
Content-Type: text/x-shellscript;

#!/usr/bin/env bash
cat > /etc/systemd/system/nodeadm-run.service << EOF
[Unit]
Description=EKS Nodeadm Run
Documentation=https://github.com/awslabs/amazon-eks-ami
# start after cloud-init, in order to pickup changes the
# user may have applied via cloud-init scripts
After=nodeadm-config.service cloud-final.service containerd.service
Requires=nodeadm-config.service

[Service]
Type=oneshot
ExecStart=/usr/bin/nodeadm init --skip config
RestartSec=5s
Restart=on-failure

[Install]
WantedBy=multi-user.target
EOF

systemctl daemon-reload

--BOUNDARY--

I haven't seen this occur, but the bug sounds legit to me. I think we'll want to handle this in nodeadm instead of with systemd ujnit dependencies, we can just wait until our CRI client can connect to the socket. I'll get a PR together this week 👍

ajvn commented

Thank you, let me know when there's an AMI available to test and I'll give it a go.

#1965 should help with containerd not being ready before initiating the sandbox image pulls 👍

ajvn commented

Just so I understand properly, this is not AMI version dependent, it will automatically pull new nodeadm version when new node joins the cluster?

@ajvn nodeadm gets built with the ami, so you'll get updates when the next ami release happens 👍

ajvn commented

Gotcha, thanks 👍
I'll let you know how it goes once there's new AMI released and I take it into use.