CA Certificate used to pull images from private repo not used by kubelet after v20240202
dtledev opened this issue · 4 comments
What happened:
-
Images from private hosted repo with HTTPS domain no longer able to be pulled after upgrading nodes to latest versions
-
In order for EKS nodes to trust URLs from a private corporate domain we load a ca-certificate is loaded onto the node via userdata. At startup the node downloads the certificate
our-cert.crt
to/etc/pki/ca-trust/source/anchors
directory and executesupdate-ca-trust
command. This stops working on release v20240202 and onwards -
Similar pattern to this blog post: https://aws.amazon.com/blogs/containers/use-private-certificates-to-enable-a-container-repository-in-amazon-eks/
-
One of the destinations is a private hosted Quay container repo with an HTTPS URL which will be referenced as
https://quay.ourprivatedomain.com
.
Sample error:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 4m25s default-scheduler Successfully assigned aws-load-balancer-controller/aws-load-balancer-controller-6d947b5655-rnvzw to ip-10-128-26-140.ca-central-1.compute.internal
Warning FailedCreatePodSandBox 4m24s kubelet Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "e3584ead4e35206f99dafb85b2e5831b28c8996698ba2deb727df66c63a4c261": plugin type="aws-cni" name="aws-cni" failed (add): add cmd: Error received from AddNetwork gRPC call: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing: dial tcp 127.0.0.1:50051: connect: connection refused"
Normal Pulling 2m36s (x4 over 4m13s) kubelet Pulling image "quay.ourprivatedomain.com/eks/aws-load-balancer-controller:v2.6.0"
Warning Failed 2m36s (x4 over 4m13s) kubelet Failed to pull image "quay.ourprivatedomain.com/eks/aws-load-balancer-controller:v2.6.0": rpc error: code = Unknown desc = failed to pull and unpack image "quay.ourprivatedomain.com/eks/aws-load-balancer-controller:v2.6.0": failed to resolve reference "quay.prod-openshift-na.hybrid.sunlifecorp.com/eks/aws-load-balancer-controller:v2.6.0": failed to do request: Head "https://quay.ourprivatedomain.com/v2/eks/aws-load-balancer-controller/manifests/v2.6.0": tls: failed to verify certificate: x509: certificate signed by unknown authority
Warning Failed 2m36s (x4 over 4m13s) kubelet Error: ErrImagePull
Normal BackOff 2m21s (x6 over 4m12s) kubelet Back-off pulling image "quay.prod-openshift-na.hybrid.sunlifecorp.com/eks/aws-load-balancer-controller:v2.6.0"
Warning Failed 2m21s (x6 over 4m12s) kubelet Error: ImagePullBackOff
- We have not encountered this issue until upgrading to this version and later
- It works when reverting back to v20240117
What you expected to happen:
Expect the kubelet to respect the trusted CAs that's on the node.
How to reproduce it (as minimally and precisely as possible):
- Upgrade nodes to use v20240202 or later
- load ca-cert for domain onto node
- deploy pod so that kubernetes will pull the image from private hosted repo server
Anything else we need to know?:
Troubleshooted by logging into the node and executing commands via systems manager
- No issues with network connectivity from the node to "quay.ourprivatedomain.com" on port 443
- Confirmed that the certificate was loaded on the node
- Confirmed that the certificate is valid and not expired and the destination URL uses the certificate
- Able to execute
ctr image pull xxxxxx
and successfully pull an image to the node from CLI (via SSM) without passing--skip-verify
flag - Appears to only affect managed node group and not seem to be an issue with Karpenter created nodes that use the same userdata to load the certificate
Environment:
-
AWS Region: ca-central-1
-
Instance Type(s): t3a.xlarge
-
EKS Platform version (use
aws eks describe-cluster --name <name> --query cluster.platformVersion
): "eks.6" -
Kubernetes version (use
aws eks describe-cluster --name <name> --query cluster.version
): "1.29" -
AMI Version: v20240202 and later
-
Kernel (e.g.
uname -a
):
Linux ip-10-128-95-56.ca-central-1.compute.internal 5.10.205-195.807.amzn2.x86_64 #1 SMP Tue Jan 16 18:28:59 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux
-
Release information (run
cat /etc/eks/release
on a node):
BASE_AMI_ID="ami-059705a71ed021143"
BUILD_TIME="Fri Feb 2 16:56:07 UTC 2024"
BUILD_KERNEL="5.10.205-195.807.amzn2.x86_64"
ARCH="x86_64"
What does the user data log look like? /var/log/cloud-init-output.log
Looks like it ran successfully, I can confirm that it is able to download the .crt file and run the update-ca-trust
.
I can see the certificates loaded in the file /etc/ssl/certs/ca-bundle.crt
Which is technically a symlink to /etc/pki/ca-trust/extracted/pem/tls-ca-bundle.pem
Do you restart containerd
after you update-ca-trust
?
restarting containerd after update-ca-trust
seems to resolve it, thank you!
This makes sense, perhaps something in the order or containerd between those releases.