Nodeadm-run service fails if containerd is not fully initialized and running
ajvn opened this issue · 9 comments
What happened:
We are building our own AMIs with preloaded critical container images. This helps us speed up node getting into a ready state, and we have those images available on the node itself in case registries are down.
However, this slows down startup of containerd
a little bit, and if it's not fully up and running before nodeadm
tries to pull pause
image, nodeadm-run
service will fail, and node won't join the cluster, while it remains up and running, and will have to be removed manually.
Other option is getting into the node itself and restarting nodeadm-run
service. After that it will join the cluster, but
this obviously is not a scale-friendly solution.
This happens randomly, so far on average it affects every 4th node joining the cluster.
What you expected to happen:
It looks like nodeadm
tries to pull pause
image 3 times, after that it exits with an error:
It would be good if we could adjust how many times it should retry and/or how often it should retry via configuration option or a flag (preferably configuration option, so we don't have to adjust systemd
unit file).
Potentially we could try adding containerd.service
to the After=
section of systemd
unit file, but I don't know if this helps as we need containerd
to be fully up and running, not only being active according to the systemd
.
[Unit]
Description=EKS Nodeadm Run
Documentation=https://github.com/awslabs/amazon-eks-ami
# start after cloud-init, in order to pickup changes the
# user may have applied via cloud-init scripts
After=nodeadm-config.service cloud-final.service <= adding containerd.service here
Requires=nodeadm-config.service
[Service]
Type=oneshot
ExecStart=/usr/bin/nodeadm init --skip config
[Install]
WantedBy=multi-user.target
How to reproduce it (as minimally and precisely as possible):
Have containerd
not be fully ready before nodeadm-run
service executes.
Anything else we need to know?:
Here's some additional information which helped me during the investigation:
$ sudo journalctl --no-pager -b -u nodeadm-run
...
Aug 10 04:22:54 nodeadm[3026]: {"level":"info","ts":1723263774.9953516,"caller":"containerd/sandbox.go:48","msg":"Pulling sandbox image..","image":"602401143452.dkr.ecr.eu-west-1.amazonaws.com/eks/pause:3.5"}
Aug 10 04:22:54 nodeadm[3026]: E0810 04:22:54.996931 3026 remote_image.go:135] PullImage "602401143452.dkr.ecr.eu-west-1.amazonaws.com/eks/pause:3.5" from image service failed: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing: dial unix /run/containerd/containerd.sock: connect: no such file or directory"
Aug 10 04:22:56 nodeadm[3026]: {"level":"info","ts":1723263776.997649,"caller":"containerd/sandbox.go:48","msg":"Pulling sandbox image..","image":"602401143452.dkr.ecr.eu-west-1.amazonaws.com/eks/pause:3.5"}
Aug 10 04:22:56 nodeadm[3026]: E0810 04:22:56.997752 3026 remote_image.go:135] PullImage "602401143452.dkr.ecr.eu-west-1.amazonaws.com/eks/pause:3.5" from image service failed: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing: dial unix /run/containerd/containerd.sock: connect: no such file or directory"
Aug 10 04:23:00 nodeadm[3026]: {"level":"info","ts":1723263780.998839,"caller":"containerd/sandbox.go:48","msg":"Pulling sandbox image..","image":"602401143452.dkr.ecr.eu-west-1.amazonaws.com/eks/pause:3.5"}
Aug 10 04:23:01 nodeadm[3026]: E0810 04:23:01.003106 3026 remote_image.go:135] PullImage "602401143452.dkr.ecr.eu-west-1.amazonaws.com/eks/pause:3.5" from image service failed: rpc error: code = Unknown desc = server is not initialized yet
Aug 10 04:23:09 nodeadm[3026]: {"level":"fatal","ts":1723263789.010438,"caller":"nodeadm/main.go:36","msg":"Command failed","error":"rpc error: code = Unknown desc = server is not initialized yet","stacktrace":"main.main\n\t/workdir/cmd/nodeadm/main.go:36\nruntime.main\n\t/root/sdk/go1.21.9/src/runtime/proc.go:267"}
Aug 10 04:23:09 systemd[1]: nodeadm-run.service: Main process exited, code=exited, status=1/FAILURE
Aug 10 04:23:09 systemd[1]: nodeadm-run.service: Failed with result 'exit-code'.
Aug 10 04:23:09 systemd[1]: Failed to start nodeadm-run.service - EKS Nodeadm Run.
Mainly related to times when containerd
and nodeadm-run
services started:
● containerd.service - containerd container runtime
Loaded: loaded (/usr/lib/systemd/system/containerd.service; disabled; preset: disabled)
Drop-In: /etc/systemd/system/containerd.service.d
└─00-runtime-slice.conf
Active: active (running) since Sat 2024-08-10 04:23:01 UTC; 2h 14min ago
Docs: https://containerd.io
Process: 3050 ExecStartPre=/sbin/modprobe overlay (code=exited, status=0/SUCCESS)
Main PID: 3058 (containerd)
Tasks: 89
Memory: 29.8M
CPU: 4.262s
CGroup: /runtime.slice/containerd.service
└─3058 /usr/bin/containerd
Warning: some journal files were not opened due to insufficient permissions.
---
× nodeadm-run.service - EKS Nodeadm Run
Loaded: loaded (/etc/systemd/system/nodeadm-run.service; enabled; preset: disabled)
Active: failed (Result: exit-code) since Sat 2024-08-10 04:23:09 UTC; 2h 14min ago
Docs: https://github.com/awslabs/amazon-eks-ami
Process: 3026 ExecStart=/usr/bin/nodeadm init --skip config (code=exited, status=1/FAILURE)
Main PID: 3026 (code=exited, status=1/FAILURE)
CPU: 107ms
Warning: some journal files were not opened due to insufficient permissions.
Environment:
I don't believe it's relevant in this case, but if requested I am happy to oblige.
Workaround for now is replacing default systemd unit file with this, added to node launch template:
...
--BOUNDARY
Content-Type: text/x-shellscript;
#!/usr/bin/env bash
cat > /etc/systemd/system/nodeadm-run.service << EOF
[Unit]
Description=EKS Nodeadm Run
Documentation=https://github.com/awslabs/amazon-eks-ami
# start after cloud-init, in order to pickup changes the
# user may have applied via cloud-init scripts
After=nodeadm-config.service cloud-final.service containerd.service
Requires=nodeadm-config.service
[Service]
Type=oneshot
ExecStart=/usr/bin/nodeadm init --skip config
RestartSec=5s
Restart=on-failure
[Install]
WantedBy=multi-user.target
EOF
systemctl daemon-reload
--BOUNDARY--
I haven't seen this occur, but the bug sounds legit to me. I think we'll want to handle this in nodeadm
instead of with systemd ujnit dependencies, we can just wait until our CRI client can connect to the socket. I'll get a PR together this week 👍
Thank you, let me know when there's an AMI available to test and I'll give it a go.
#1965 should help with containerd not being ready before initiating the sandbox image pulls 👍
Just so I understand properly, this is not AMI version dependent, it will automatically pull new nodeadm
version when new node joins the cluster?
@ajvn nodeadm
gets built with the ami, so you'll get updates when the next ami release happens 👍
Gotcha, thanks 👍
I'll let you know how it goes once there's new AMI released and I take it into use.