Pods can't run due to failures pulling `pause` image; `pause` image is being incorrectly garbage collected
ForbiddenEra opened this issue · 7 comments
What happened:
Nodes stop being able to create new pods, error says `Failed to create pod sandbox: rpc error: code = Unknown desc = failed to get sandbox image "602401143452.dkr.ecr.ca-central-1.amazonaws.com/eks/pause:3.5" failed to pull and unpack image "602401143452.dkr.ecr.ca-central-1.amazonaws.com/eks/pause:3.5": failed to resolve reference "602401143452.dkr.ecr.ca-central-1.amazonaws.com/eks/pause:3.5": unexpected status from HEAD request to https://602401143452.dkr.ecr.ca-central-1.amazonaws.com/v2/eks/pause/manifests/3.5: 401 Unauthorized"
What you expected to happen:
Pods should work.
How to reproduce it (as minimally and precisely as possible):
Try to deploy a pod on EKS. Up until right now, I had no idea an exact reproduction but I have at least a slight idea now. After diving into this again because nodes started failing to deploy pods again, I started Googling and again, wasn't able to find much, but saw #1425 again and re-read it.
This oddly makes sense, but only after diving deeper to try and perhaps see if what that issue said about the images getting removed could possibly be happening, not that it should as I am using off the shelf AMIs and never access the nodes directly, but I wanted to check.
Upon trying to find a reference on where kubelet/kubernetes caches images as I didn't know offhand, I found this in the Kubernetes docs:
Garbage collection for unused container images
FEATURE STATE: Kubernetes v1.29 [alpha]As an alpha feature, you can specify the maximum time a local image can be unused for, regardless of disk usage. This is a kubelet setting that you configure for each node.
To configure the setting, enable the ImageMaximumGCAge feature gate for the kubelet, and also set a value for the ImageMaximumGCAge field in the kubelet configuration file.
The value is specified as a Kubernetes duration; for example, you can set the configuration field to 3d12h, which means 3 days and 12 hours
I don't know for sure if this is related or not, but, I feel like it's a possibility. The last time the issue popped up was after the weekend, a time where our nodes could easily go 3 days, 12 hours with no new pods.
Anything else we need to know?:
This has only been happening since upgrading to 1.29 on EKS and has happened on both AL pods (IIRC both AL2 and AL2023) as well as the Ubuntu EKS pods. Both on self-managed and EKS-managed node groups. Seems to affect nodes randomly, it's never an entire node group or anything, nothing to correlate the nodes together.
AL2023_x86_64_STANDARD-1.29.0-20240307
Environment:
- AWS Region: ca-central-1
- Instance Type(s): t3.medium, t3.large
- EKS Platform version: eks.3
- Kubernetes version: 1.29
- AMI Version: For AL2/AL2023 (EKS-managed nodes) I can't say exact AMIs for certain as I don't specifically log them (use the Terraform AWS EKS module) it would be the latest versions available anytime I applied our terraform configuration over the last 30-45 days and can't seem to find the actual AMI IDs easily in the console but my EKS managed group is currently on AL2023_x86_64_STANDARD-1.29.0-20240307, and for Ubuntu (self-managed nodes) 20240229, 20240301, 20240318, 20240322, not sure exact AMIs except for 20240322 which would be ami-0256abce7a68aa374
- Kernel (e.g.
uname -a
): 6.1.79-99.164.amzn2023.x86_64 for the AL2023 nodes, 6.5.0-1015-aws for Ubuntu - Release information (run
cat /etc/eks/release
on a node):
For AL2023 nodes:
BASE_AMI_ID="ami-04d16ea6ebbe475e3"
BUILD_TIME="Thu Mar 7 07:30:33 UTC 2024"
BUILD_KERNEL="6.1.79-99.164.amzn2023.x86_64"
ARCH="x86_64"
Ubuntu nodes don't have a /etc/eks/release
file.
I don't see the ImageMaximumGCAge
feature gate being passed to the kubelet when reviewing it's parameters via ps aux
, but I also said I'm not 100% sure that is the issue, the issue exists either way, that was just my only remote idea of why/how it could be happening and the 3.5 days length seems to line up fishily well.
Any ideas!?
Just checked; indeed the nodes that are experiencing images no longer have the pause
image, whereas ones that are fine, still have it. Other images that you'd expect are still there, such as amazon-k8s-cni
, amazon-k8s-cni-init
, aws-network-policy-agent
, aws-ebs-csi-driver
, csi-node-driver-registrar
, kube-proxy
, liveness-probe
but pause
is gone.
Welp, whether it's the aforementioned GC or another, it's GC:
Mar 29 00:19:21 ip-10-102-0-15 kubelet-eks.daemon[8709]: I0329 00:19:21.684279 8709 image_gc_manager.go:349] "Disk usage on image filesystem is over the high threshold, trying to free bytes down to the low threshold" usage=85 highThreshold=85 amountToFree=1326369996 lowThreshold=80
...
Mar 29 00:19:21 ip-10-102-0-15 containerd[3532]: time="2024-03-29T00:19:21.690427202Z" level=info msg="ImageDelete event name:\"602401143452.dkr.ecr.ca-central-1.amazonaws.com/eks/pause:3.5\""
...
I've never noticed any disk space issues previously; regardless of that though, the pause
image should never be GC'd in any case if it breaks the node - especially when the image is 299kB
!@?$
Also, I haven't fully gotten strict on defining where certain pods get deployed on which node on this cluster, one of the current nodes having an issue has 32gb; others that are fine right now have 20gb, I don't think the issue is at all related to that I need to increase space, I mean of course, GC should've never affected that image in the case that it's not re-pull-able, but I've otherwise not ever run into anything that hints that I should increase the space, currently this node has nearly 10gb free even, so otherwise GC is doing it's job fine.
This was an issue with 1.29 that should be addressed on AL2: #1597
We haven't had any reports of this on AL2023. Can you verify that containerd
is reporting the sandbox image as pinned
?
> sudo crictl inspecti $SANDBOX_IMAGE | grep pinned
"pinned": true
crictl inspecti $SANDBOX_IMAGE | grep pinned
$SANDBOX_IMAGE isn't set, not sure if it's meant to be; on my AL2023 nodes:
{
"status": {
"pinned": true
},
// ...
}
Seems like it is pinned. I haven't encountered it on the AL2023 nodes since the last image update I did a few days ago, perhaps that fix got pushed both ways? Pretty sure I also saw it happen once on AL2 before I switched, which makes sense with the issue you linked, wish I had seen that issue much earlier, didn't come up in my searching!
Definitely not pinned on the latest Ubuntu AMI though. Any idea where I can report that?
Weirdly enough though, on the Ubuntu image in /etc/containerd/config.toml
:
[plugins."io.containerd.grpc.v1.cri"]
sandbox_image = "602401143452.dkr.ecr.ca-central-1.amazonaws.com/eks/pause:3.5"
I tried restarting containerd
for sags since that is in the config, but still not showing pinned.. crictl info
shows sandboxImage
set correctly and identically on both.
I'm gonna close this due to the existence of #1597; I wish I had found that issue in my initial searches! if I see it again on AL2023 I'll comment or open a new issue, otherwise if someone knows where to post an issue for the Ubuntu AMI, I'd be appreciative to be pointed in that direction.
Normal Scheduled 18s default-scheduler Successfully assigned iwork-ui/iwork-ui-deployment-748c87cc58-5l57j to ip-172-31-47-25.ap-south-1.compute.internal
Warning FailedCreatePodSandBox 7s (x2 over 18s) kubelet Failed to create pod sandbox: rpc error: code = Unknown desc = failed to get sandbox image "602401143452.dkr.ecr.ap-south-1.amazonaws.com/eks/pause:3.5": failed to pull image "602401143452.dkr.ecr.ap-south-1.amazonaws.com/eks/pause:3.5": failed to pull and unpack image "602401143452.dkr.ecr.ap-south-1.amazonaws.com/eks/pause:3.5": failed to resolve reference "602401143452.dkr.ecr.ap-south-1.amazonaws.com/eks/pause:3.5": unexpected status from HEAD request to https://602401143452.dkr.ecr.ap-south-1.amazonaws.com/v2/eks/pause/manifests/3.5: 401 Unauthorized
Normal Scheduled 18s default-scheduler Successfully assigned iwork-ui/iwork-ui-deployment-748c87cc58-5l57j to ip-172-31-47-25.ap-south-1.compute.internal Warning FailedCreatePodSandBox 7s (x2 over 18s) kubelet Failed to create pod sandbox: rpc error: code = Unknown desc = failed to get sandbox image "602401143452.dkr.ecr.ap-south-1.amazonaws.com/eks/pause:3.5": failed to pull image "602401143452.dkr.ecr.ap-south-1.amazonaws.com/eks/pause:3.5": failed to pull and unpack image "602401143452.dkr.ecr.ap-south-1.amazonaws.com/eks/pause:3.5": failed to resolve reference "602401143452.dkr.ecr.ap-south-1.amazonaws.com/eks/pause:3.5": unexpected status from HEAD request to https://602401143452.dkr.ecr.ap-south-1.amazonaws.com/v2/eks/pause/manifests/3.5: 401 Unauthorized
Not sure what you're saying w/out any comment and only an error but you should review #1597 for some workarounds if needed; if you're experiencing it on AL2023 then definitely report back, otherwise #1597 is for AL2 and I was experiencing it on Ubuntu's EKS AMI (jammy) which this isn't the place for reporting that