aws/karpenter-provider-aws

Karpenter conversion webhooks do not work on chart version 1.0.2

dcherniv opened this issue · 10 comments

Description

Observed Behavior:
karpenter Chart version: 1.0.2
karpenter-crd Chart version: 1.0.2
with webhooks enabled in values as follows:

karpenter-crd:
  webhook:
    enabled: false
karpenter:
  enabled: true
  webhook:
    enabled: false

The webhooks fail with the below error with no indication as to why:

{"level":"ERROR","time":"2024-09-11T20:30:53.969Z","logger":"controller","message":"Reconciler error","commit":"b897114","controller":"nodeclaim.podevents","controllerGroup":"","controllerKind":"Pod","Pod":{"name":"istio-gateway-internal-secret-init-84df655d8b-sdxn4","namespace":"istio-system"},"namespace":"istio-system","name":"istio-gateway-internal-secret-init-84df655d8b-sdxn4","reconcileID":"86d25014-bd4b-4b5d-852f-a5c4d50fe7c4","error":"Internal error occurred: failed calling webhook \"validation.webhook.karpenter.sh\": failed to call webhook: the server rejected our request for an unknown reason"}

Expected Behavior:
Webhook to work in 1.0.2

Reproduction Steps (Please include YAML):
#6847 (comment)

Versions:

  • Chart Version: 1.0.2
  • Kubernetes Version (kubectl version): 1.29+ eks
  • Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
  • Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment

Can you share if there is anything in the logs that indicates why the webhook rejected the request? I'm also curious if this even made it to the webhook or if something is getting in the way of the network traffic to the service/pod.

@jonathan-innis nothing beyond this.
{"level":"ERROR","time":"2024-09-11T20:30:53.969Z","logger":"controller","message":"Reconciler error","commit":"b897114","controller":"nodeclaim.podevents","controllerGroup":"","controllerKind":"Pod","Pod":{"name":"istio-gateway-internal-secret-init-84df655d8b-sdxn4","namespace":"istio-system"},"namespace":"istio-system","name":"istio-gateway-internal-secret-init-84df655d8b-sdxn4","reconcileID":"86d25014-bd4b-4b5d-852f-a5c4d50fe7c4","error":"Internal error occurred: failed calling webhook "validation.webhook.karpenter.sh": failed to call webhook: the server rejected our request for an unknown reason"}
This is from Karpenter logs.
I'm assuming this was trying to validate a nodeclaim which didn't have group field set.
Reading the docs for the migration to v1. Can you confirm that 0.37.3 doesn't mutate resources?
Its worth noting that at the point I got this error all the resources I have defined for node-pools etc in my chart were already upgraded to v1 spec and deployed prior to deploying Karpenter 1.0.2 chart
My path was:

  1. Upgrade to 0.37.3
  2. Upgrade all manifests to v1 spec on 0.37.3
  3. Deploy Karpenter 1.0.2

Upgrade all manifests to v1 spec on 0.37.3

When you run through this part of your upgrade, did this also have the v1 storage version? Even with the resources on the v1 spec, if they aren't stored on the correct storage version, there can be issues during the upgrade.

Also, from what it looks like, there is something on the network path that is blocking this traffic from getting through. If there's no errors on the webhook side, that would indicate to me that there's something preventing the call coming from the apiserver to the pod service endpoint.

@jonathan-innis Found a couple of folks who encountered the same #6847 (comment)
and #6879 (comment)
Deleting the validating webhooks altogether resolves this. I think they are carried over from 0.37.3 and are not needed on 1.0.2? Can you confirm? Its weird that they are not cleaned up upon upgrade automatically.

Agreed, this looks like an issue with the interaction that Karpenter has with Argo. We definitely need to look at this since we shouldn't be leaving behind MutatingWebhookConfigurations and ValidatingWebhookConfigurations after the upgrade.

So, I think that we tracked it down. There's an open issue here that actually specifically describes this interaction between Knative and ArgoCD. What it comes down to is that ArgoCD refuses to delete objects with ownerReferences -- even if it was the original creator of the object.

Because knative injects an owner reference onto the ValidatingWebhookConfiguration and MutatingWebhookConfiguration as part of its reconciliation flow, this causes ArgoCD to ignore these objects when pruning and leave them behind on the cluster. I'm trying out what happens when we don't have these owner references by patching the knative package that we are using here: #7042 to see if that resolves the issue (I suspect it should).

Code link to where ArgoCD does this check: https://github.com/argoproj/gitops-engine/blob/master/pkg/cache/cluster.go#L1167

Could this be part of the issue?
https://github.com/aws/karpenter-provider-aws/blob/main/charts/karpenter/Chart.yaml
https://github.com/aws/karpenter-provider-aws/blob/v1.0.2/charts/karpenter/Chart.yaml
^-- v1.0.0 of the helm chart should use v1 CRDs
but from what I can tell it's referencing v1beta1 CRDs

Also in the future after these upstream issues get resolved, it'd be nice if EKS Blueprints karpenter addon could get some love.
aws-quickstart/cdk-eks-blueprints#1078
I tried doing a fresh install of karpenter helm chart 1.0.2 (via eks blueprints for cdk) where there should be no need to convert crds / no ned to convert old versions of the app, in that case even a fresh install doesn't work, and installs older beta versions of the CRDs from what I can tell and I think it might be the fault of the upstream helm chart, based on references in the above link.

We just released latest patch versions of pre-v1.0 versions that fix this issue so that these configuration resources aren't leaked. Please use one of the following versions prior to going to 1.0.x since these versions remove the ownerReference that is causing Argo to leak the resources and causes the failure on upgrade:

  • v0.33.10
  • v0.34.11
  • v0.35.10
  • v0.36.7
  • v0.37.5

Closing since this should be resolved now.