awslabs/amazon-eks-ami

Pods stuck in ContainerCreating due to pause image pull error 401 unauthorized

VikramPunnam opened this issue ยท 7 comments

We generally build custom EKS AMI using EKS optimized AMI as base image in ap-south-1 region and copies to other regions for EKS cluster setup.

Having the below issue in the EKS after upgrade to 1.27, if the pause image gets deleted on the node.

Failed to create pod sandbox: rpc error: code = Unknown desc = failed to get sandbox image "900889452093.dkr.ecr.ap-south-2.amazonaws.com/eks/pause:3.5": failed to pull image "900889452093.dkr.ecr.ap-south-2.amazonaws.com/eks/pause:3.5": failed to pull and unpack image "900889452093.dkr.ecr.ap-south-2.amazonaws.com/eks/pause:3.5": failed to resolve reference "900889452093.dkr.ecr.ap-south-2.amazonaws.com/eks/pause:3.5": pulling from host 900889452093.dkr.ecr.ap-south-2.amazonaws.com failed with status code [manifests 3.5]: 401 Unauthorized

Can anyone help me, please?

How did the pause image get deleted from the node?

I've seen this failure mode a few times in the past, because containerd doesn't have a way to obtain ECR credentials to pull the sandbox container image. That's why we pull it with a systemd unit at launch time (if it's not already cached in the AMI): https://github.com/awslabs/amazon-eks-ami/blob/master/files/sandbox-image.service

You could systemctl restart sandbox-image to trigger a pull, and we could feasibly run this periodically so this isn't a terminal node failure; but I'd still look into why the image was deleted to begin with.

Hi @cartermckinnon,

Thanks for your reply.

We used a custom script that runs on every node to cleanup the unused images and exited containers on the node. which is being removing the pause image as well on the node. which is causing trouble in our environment.

Now we modified the script that can exclude some images on the node.

I've been running into this issue on nodes randomly since 1.29 upgrade. Both AWS EKS managed nodes running AL2 and AL2023 as well as Ubuntu's EKS image...

Getting really frustrating to have keep refreshing nodes as that's the only fix I can figure out..

Can't find much info/threads about it, this was one of the few. Nothing is modified on the nodes themselves, running provided AMIs and never access the nodes directly.

So, I was just about to try and log into the nodes that are currently affected for me and was checking the docs because I don't know offhand where/how kubelet/k8s caches images.. as of 1.29:

Garbage collection for unused container images
FEATURE STATE: Kubernetes v1.29 [alpha]

As an alpha feature, you can specify the maximum time a local image can be unused for, regardless of disk usage. This is a kubelet setting that you configure for each node.

To configure the setting, enable the ImageMaximumGCAge [feature gate](https://kubernetes.io/docs/reference/command-line-tools-reference/feature-gates/) for the kubelet, and also set a value for the ImageMaximumGCAge field in the kubelet configuration file.

The value is specified as a Kubernetes duration; for example, you can set the configuration field to 3d12h, which means 3 days and 12 hours

that sounds incredibly fishily interesting and potentially the issue

note this is happening also on EKS-managed nodes running Amazon Linux AMIs?

I've seen this failure because I tried to prune the images by myself using the crictl rmi --prune. Running the command systemctl restart sandbox-image as @cartermckinnon suggested fixed the problem, but I was wondering, do we really need to prune the images by ourselves?

I saw this article https://repost.aws/knowledge-center/eks-worker-nodes-image-cache that suggests we already have a cleanup of the image in according with the image-gc-high-threshold attribute (default to 85%).

Default values for one node that I have:

curl -sSL "http://localhost:8001/api/v1/nodes/<MY_NODE_NAME>/proxy/configz" | python3 -m json.tool | grep image
        "imageMinimumGCAge": "2m0s",
        "imageMaximumGCAge": "0s",
        "imageGCHighThresholdPercent": 85,
        "imageGCLowThresholdPercent": 80,

do we really need to prune the images by ourselves?

Nope! The kubelet will take care of this. Deleting images out of band almost always hurts more than it helps.