buildpacks-community/kpack

ClusterBuilder never becomes ready on AWS ECR private registry

georgethebeatle opened this issue · 2 comments

In Korifi we are trying to bump kpack to 0.13.1.

After bumping kpack to 0.13.1 in Korifi we noticed that the ClusterBuilder that our helm chart creates never becomes ready when deploying against a private Amazon ECR registry. With kpack 0.12.3 we do not see this problem.

The clusterbuilder status below indicates that the registry denies access to the kpack controller:

status:
  conditions:
  - lastTransitionTime: "2024-02-01T09:58:11Z"
    message: Builder has no latestImage
    reason: NoLatestImage
    status: "False"
    type: Ready
  - lastTransitionTime: "2024-02-01T09:58:11Z"
    message: 'HEAD https://007801690126.dkr.ecr.eu-west-1.amazonaws.com/v2/eks-e2e-kpack-builder/manifests/latest:
      unexpected status code 401 Unauthorized (HEAD responses have no body, use GET
      for details)'
    reason: ReconcileFailed
    status: "False"
    type: UpToDate
  observedGeneration: 1
  stack: {}

Here is the related kpack-controller log:

{"level":"error","ts":"2024-02-01T09:58:11.770660587Z","logger":"controller","caller":"controller/controller.go:566","msg":"Reconcile error","commit":"843bfcd","knative.dev/kind":"clusterbuilders.kpack.io","knative.dev/traceid":"f54d4504-c2ec-4657-a626-77dc7977af73","knative.dev/key":"cf-kpack-cluster-builder","duration":0.885850232,"error":"HEAD https://007801690126.dkr.ecr.eu-west-1.amazonaws.com/v2/eks-e2e-kpack-builder/manifests/latest: unexpected status code 401 Unauthorized (HEAD responses have no body, use GET for details)","stacktrace":"knative.dev/pkg/controller.(*Impl).handleErr\n\tknative.dev/pkg@v0.0.0-20230821102121-81e4ee140363/controller/controller.go:566\n
knative.dev/pkg/controller.(*Impl).processNextWorkItem\n\tknative.dev/pkg@v0.0.0-20230821102121-81e4ee140363/controller/controller.go:543\nknative.dev/pkg/controller.(*Im
pl).RunContext.func3\n\tknative.dev/pkg@v0.0.0-20230821102121-81e4ee140363/controller/controller.go:491"}

We are running the kpack controller with a serviceaccount that is mapped to a EKS role with the following policy:

{
	"Version": "2012-10-17",
	"Statement": [
		{
			"Action": [
				"ecr:BatchCheckLayerAvailability",
				"ecr:BatchDeleteImage",
				"ecr:BatchGetImage",
				"ecr:CompleteLayerUpload",
				"ecr:CreateRepository",
				"ecr:GetAuthorizationToken",
				"ecr:GetDownloadUrlForLayer",
				"ecr:InitiateLayerUpload",
				"ecr:ListImages",
				"ecr:PutImage",
				"ecr:UploadLayerPart"
			],
			"Effect": "Allow",
			"Resource": "*"
		}
	]
}

We have had no issues with this role so far (we double checked it is working with kpack 0.12.3). We tried giving full ECR access to the serviceaccount by assigning ecr:* policy, but it made no difference. This made us think that the credentials are somehow not being picked up by the code.

In AWS the credentials are being injected in the pod environment. This injection is done by an aws webhook. This webhook would inspect the serviceaccount and if it is annotated with the eks.amazonaws.com/role-arn annotation it would inject the related aws credentials as env vars in all pods running with that serviceaccount. In our case this is the kpack controller pod. We can see that this does happen:

spec:
  containers:
  - env:
    ...
    - name: AWS_STS_REGIONAL_ENDPOINTS
      value: regional
    - name: AWS_DEFAULT_REGION
      value: eu-west-1
    - name: AWS_REGION
      value: eu-west-1
    - name: AWS_ROLE_ARN
      value: arn:aws:iam::007801690126:role/eks-e2e-ecr_access
    - name: AWS_WEB_IDENTITY_TOKEN_FILE
      value: /var/run/secrets/eks.amazonaws.com/serviceaccount/token

We suspected that the go-containerregistry dependency of kpack which got bumped as part of 0.13.1 might somehow be failing to propagate this information to ECR, so we tried downgrading it from 0.17.0 to 0.16.1 and rebuilding the kpack images. Unfortunately after replacing the controller image with the one we patched we observed the same behaviour.

Surprisingly, the true culprit is 1fcfca6 (hurrah git bisect). Still looking into why this broke it, but at least it gives us a starting place

This whole thing is caused by how AWS versions their SDKs aws/aws-sdk-go-v2#2370 (comment).

Because they use a different version per service, when one of their core libraries make a backwards incompatible change, the other libs needs to be bumped to interop with it. Also for whatever reason, their repo is named v2, but the release git tag is v1. Oh and when they do backwards incompatible changes, they don't bump the major as required by semver.