awslabs/amazon-eks-ami

CA Certificate used to pull images from private repo not used by kubelet after v20240202

dtledev opened this issue · 4 comments

What happened:

  • Images from private hosted repo with HTTPS domain no longer able to be pulled after upgrading nodes to latest versions

  • In order for EKS nodes to trust URLs from a private corporate domain we load a ca-certificate is loaded onto the node via userdata. At startup the node downloads the certificate our-cert.crt to /etc/pki/ca-trust/source/anchors directory and executes update-ca-trust command. This stops working on release v20240202 and onwards

  • Similar pattern to this blog post: https://aws.amazon.com/blogs/containers/use-private-certificates-to-enable-a-container-repository-in-amazon-eks/

  • One of the destinations is a private hosted Quay container repo with an HTTPS URL which will be referenced as https://quay.ourprivatedomain.com.

Sample error:

 Type     Reason                  Age                    From               Message
  ----     ------                  ----                   ----               -------
  Normal   Scheduled               4m25s                  default-scheduler  Successfully assigned aws-load-balancer-controller/aws-load-balancer-controller-6d947b5655-rnvzw to ip-10-128-26-140.ca-central-1.compute.internal
  Warning  FailedCreatePodSandBox  4m24s                  kubelet            Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "e3584ead4e35206f99dafb85b2e5831b28c8996698ba2deb727df66c63a4c261": plugin type="aws-cni" name="aws-cni" failed (add): add cmd: Error received from AddNetwork gRPC call: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing: dial tcp 127.0.0.1:50051: connect: connection refused"
  Normal   Pulling                 2m36s (x4 over 4m13s)  kubelet            Pulling image "quay.ourprivatedomain.com/eks/aws-load-balancer-controller:v2.6.0"
  Warning  Failed                  2m36s (x4 over 4m13s)  kubelet            Failed to pull image "quay.ourprivatedomain.com/eks/aws-load-balancer-controller:v2.6.0": rpc error: code = Unknown desc = failed to pull and unpack image "quay.ourprivatedomain.com/eks/aws-load-balancer-controller:v2.6.0": failed to resolve reference "quay.prod-openshift-na.hybrid.sunlifecorp.com/eks/aws-load-balancer-controller:v2.6.0": failed to do request: Head "https://quay.ourprivatedomain.com/v2/eks/aws-load-balancer-controller/manifests/v2.6.0": tls: failed to verify certificate: x509: certificate signed by unknown authority
  Warning  Failed                  2m36s (x4 over 4m13s)  kubelet            Error: ErrImagePull
  Normal   BackOff                 2m21s (x6 over 4m12s)  kubelet            Back-off pulling image "quay.prod-openshift-na.hybrid.sunlifecorp.com/eks/aws-load-balancer-controller:v2.6.0"
  Warning  Failed                  2m21s (x6 over 4m12s)  kubelet            Error: ImagePullBackOff
  • We have not encountered this issue until upgrading to this version and later
  • It works when reverting back to v20240117

What you expected to happen:
Expect the kubelet to respect the trusted CAs that's on the node.

How to reproduce it (as minimally and precisely as possible):

  • Upgrade nodes to use v20240202 or later
  • load ca-cert for domain onto node
  • deploy pod so that kubernetes will pull the image from private hosted repo server

Anything else we need to know?:
Troubleshooted by logging into the node and executing commands via systems manager

  • No issues with network connectivity from the node to "quay.ourprivatedomain.com" on port 443
  • Confirmed that the certificate was loaded on the node
  • Confirmed that the certificate is valid and not expired and the destination URL uses the certificate
  • Able to execute ctr image pull xxxxxx and successfully pull an image to the node from CLI (via SSM) without passing --skip-verify flag
  • Appears to only affect managed node group and not seem to be an issue with Karpenter created nodes that use the same userdata to load the certificate

Environment:

  • AWS Region: ca-central-1

  • Instance Type(s): t3a.xlarge

  • EKS Platform version (use aws eks describe-cluster --name <name> --query cluster.platformVersion): "eks.6"

  • Kubernetes version (use aws eks describe-cluster --name <name> --query cluster.version): "1.29"

  • AMI Version: v20240202 and later

  • Kernel (e.g. uname -a):
    Linux ip-10-128-95-56.ca-central-1.compute.internal 5.10.205-195.807.amzn2.x86_64 #1 SMP Tue Jan 16 18:28:59 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux

  • Release information (run cat /etc/eks/release on a node):

BASE_AMI_ID="ami-059705a71ed021143"
BUILD_TIME="Fri Feb  2 16:56:07 UTC 2024"
BUILD_KERNEL="5.10.205-195.807.amzn2.x86_64"
ARCH="x86_64"

What does the user data log look like? /var/log/cloud-init-output.log

Looks like it ran successfully, I can confirm that it is able to download the .crt file and run the update-ca-trust.

I can see the certificates loaded in the file /etc/ssl/certs/ca-bundle.crt
Which is technically a symlink to /etc/pki/ca-trust/extracted/pem/tls-ca-bundle.pem

Do you restart containerd after you update-ca-trust?

restarting containerd after update-ca-trust seems to resolve it, thank you!
This makes sense, perhaps something in the order or containerd between those releases.