aws/karpenter-provider-aws

Error "failed to call webhook: the server rejected our request for an unknown reason" after upgrade from 0.37.4 to 1.0.4.

Closed this issue ยท 11 comments

Description

Observed Behavior:

After upgrade from 0.37.4 to 1.0.4 I can see a lot of such errors in the karpenter logs:

"... Internal error occurred: failed calling webhook "validation.webhook.karpenter.sh": failed to call webhook: the server rejected our request for an unknown reason ..." and Internal error occurred: failed calling webhook \"defaulting.webhook.karpenter.k8s.aws\": failed to call webhook: the server rejected our request for an unknown reason"}

and I can confirm that:

  • validation.webhook.karpenter.sh
  • validation.webhook.config.karpenter.sh
  • defaulting.webhook.karpenter.k8s.aws
  • validation.webhook.karpenter.k8s.aws
    are removed during 1.0.4 deployment.

Reproduction Steps (Please include YAML):

Values.yaml for ArgoCD:

fullnameOverride: "karpenter"

serviceAccount:
  annotations:
    eks.amazonaws.com/role-arn: arn:aws:iam::XXXXXXXX:role/iamservice-role

settings:
  clusterName: cluster1
  interruptionQueue: cluster1-karpenter-interruptions
  featureGates:
    spotToSpotConsolidation: true

podAnnotations:
  prometheus.io/scrape: "true"
  prometheus.io/path: /metrics
  prometheus.io/port: "8000"

webhook:
  enabled: true

I also use such kustomizatin patches (as karpenter is deployed in karpenterns namespace:

patches:
- path: patches/karpenter/crds/nodepools.json
  target:
    kind: CustomResourceDefinition
    name: nodepools.karpenter.sh
- path: patches/karpenter/crds/ec2nodeclasses.json
  target:
    kind: CustomResourceDefinition
    name: ec2nodeclasses.karpenter.k8s.aws
- path: patches/karpenter/crds/nodeclaims.json
  target:
    kind: CustomResourceDefinition
    name: nodeclaims.karpenter.sh
cat patches/karpenter/crds/ec2nodeclasses.json
[
  {
    "op": "add",
    "path": "/spec/conversion/webhook/clientConfig/service/namespace",
    "value": "karpenterns"
  }
]

cat patches/karpenter/crds/nodeclaims.json
[
  {
    "op": "add",
    "path": "/spec/conversion/webhook/clientConfig/service/namespace",
    "value": "karpenterns"
  }
]

cat patches/karpenter/crds/nodepools.json
[
  {
    "op": "add",
    "path": "/spec/conversion/webhook/clientConfig/service/namespace",
    "value": "karpenterns"
  }
]

Versions:

  • Chart Version: 1.0.4
  • Kubernetes Version (kubectl version): Server Version: v1.28.12-eks-a18cd3a
  • Please vote on this issue by adding a ๐Ÿ‘ reaction to the original issue to help the community and maintainers prioritize this request
  • Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment

Hello!

I experienced the exact same issue when upgrading from 0.37.3 to 1.0.5 : the only difference in my case is that I had to wait to 1.0.5 that enables the migration to v1 with the webhooks (however not clear for me what webhooks are we taking about ? the ones already listed by @AndrzejWisniewski I guess) since ArgoCD is preventing the deletion of the webhooks currently. Though even by removing them manually and restarting the deployment after the upgrade we saw the same errors.

After this we can no longer provision any new nodes, however the rollback procedure to 0.37.3 is working pretty well as we have tried to upgrade to different 1.0.x versions on our dev cluster.

We are working on EKS 1.30 and our current deployment of Karpenter is done via an ArgoCD App:

Chart.yaml

apiVersion: v2
name: karpenter
description: A Helm chart for installing Karpenter
type: application

version: 0.3.0
appVersion: "0.37.3"

dependencies:
- name: karpenter
  version: 0.37.3
  repository: oci://private-registry/karpenter

values.yaml

karpenter:
  webhook:
    enabled: true
  podAnnotations:
    prometheus.io/scrape: "true"
    prometheus.io/path: "/metrics"
    prometheus.io/port: "8000"
...

I would be glad to give more details if required.

Thanks !

Seeing exact same error from an upgrade last night from v0.37.3 to v1.0.5 on K8s v1.28 (AWS EKS).

Our deployment path is using "helm template" to generate lightly templated (just name prefixes, IRSA annotations, etc) resource files that our own simple manual deployment tool (Kontemplate) fills in and replaces. Am digging in this morning and found this thread.

Our values.yaml is just:

serviceAccount:
  annotations:
    eks.amazonaws.com/role-arn: arn:aws:iam::{{.aws_account}}:role/{{.aws_role_path}}/{{.workspace}}-irsa-karpenter-v1beta1

settings:
  clusterName: x{x{.workspace}}-telemetry
  interruptionQueue: x{x{.workspace}}-telemetry-karpenter

# logLevel: debug
helm/helm-values.yaml (END)

The errors start with a TLS handshake error and then a collection of reconciler errors:

{"level":"ERROR","time":"2024-10-02T15:08:28.477Z","logger":"webhook","message":"http: TLS handshake error from 10.95.167.138:34812: EOF\n","commit":"652e6aa"}
{"level":"ERROR","time":"2024-10-02T15:08:28.482Z","logger":"controller","message":"Reconciler error","commit":"652e6aa","controller":"nodepool.readiness","controllerGroup":"karpenter.sh","controllerKind":"NodePool","NodePool":{"name":"generic-worker-autoscale"},"namespace":"","name":"generic-worker-autoscale","reconcileID":"bf399b7e-0f49-48c2-b3e7-e8be4709580d","error":"Internal error occurred: failed calling webhook \"validation.webhook.karpenter.sh\": failed to call webhook: the server rejected our request for an unknown reason"}
{"level":"ERROR","time":"2024-10-02T15:08:28.484Z","logger":"controller","message":"Reconciler error","commit":"652e6aa","controller":"migration.resource.nodepool","controllerGroup":"karpenter.sh","controllerKind":"NodePool","NodePool":{"name":"generic-worker-autoscale"},"namespace":"","name":"generic-worker-autoscale","reconcileID":"c5d363c4-ed00-4d7b-b638-593215a89bdf","error":"adding karpenter.sh/stored-version-migrated annotation to karpenter.sh/v1, Kind=NodePool, Internal error occurred: failed calling webhook \"validation.webhook.karpenter.sh\": failed to call webhook: the server rejected our request for an unknown reason"}
{"level":"ERROR","time":"2024-10-02T15:08:28.844Z","logger":"controller","message":"Reconciler error","commit":"652e6aa","controller":"migration.resource.ec2nodeclass","controllerGroup":"karpenter.k8s.aws","controllerKind":"EC2NodeClass","EC2NodeClass":{"name":"generic-worker-autoscale"},"namespace":"","name":"generic-worker-autoscale","reconcileID":"2212f030-3b93-4234-9f7a-77e9b6f5a24f","error":"adding karpenter.sh/stored-version-migrated annotation to karpenter.k8s.aws/v1, Kind=EC2NodeClass, Internal error occurred: failed calling webhook \"defaulting.webhook.karpenter.k8s.aws\": failed to call webhook: the server rejected our request for an unknown reason"}
{"level":"ERROR","time":"2024-10-02T15:08:28.977Z","logger":"controller","message":"Reconciler error","commit":"652e6aa","controller":"nodepool.counter","controllerGroup":"karpenter.sh","controllerKind":"NodePool","NodePool":{"name":"generic-worker-autoscale"},"namespace":"","name":"generic-worker-autoscale","reconcileID":"f9ad0b14-cf7d-4bfb-a922-515fe3b3432e","error":"Internal error occurred: failed calling webhook \"validation.webhook.karpenter.sh\": failed to call webhook: the server rejected our request for an unknown reason"}
{"level":"ERROR","time":"2024-10-02T15:08:33.277Z","logger":"controller","message":"Reconciler error","commit":"652e6aa","controller":"nodeclass.hash","controllerGroup":"karpenter.k8s.aws","controllerKind":"EC2NodeClass","EC2NodeClass":{"name":"generic-worker-autoscale"},"namespace":"","name":"generic-worker-autoscale","reconcileID":"2bf04eae-46d4-4e3d-bc58-f245d40a4d05","error":"Internal error occurred: failed calling webhook \"defaulting.webhook.karpenter.k8s.aws\": failed to call webhook: the server rejected our request for an unknown reason"}
{"level":"ERROR","time":"2024-10-02T15:08:43.801Z","logger":"controller","message":"Reconciler error","commit":"652e6aa","controller":"nodeclass.status","controllerGroup":"karpenter.k8s.aws","controllerKind":"EC2NodeClass","EC2NodeClass":{"name":"generic-worker-autoscale"},"namespace":"","name":"generic-worker-autoscale","reconcileID":"19cc44d4-18a4-4898-8a33-5cb3405543d8","error":"Internal error occurred: failed calling webhook \"defaulting.webhook.karpenter.k8s.aws\": failed to call webhook: the server rejected our request for an unknown reason"}

Here's what my NodePool looks like on that cluster. I need to dig through the docs to see if the annotations will give me a clue. The NodePool looks like it is already a karpenter.sh/v1. Looking at my other clusters that are still on v0.37.3 (done just before .4 dropped), they all show karpenter.sh/v1 as well - look identical to this one below on my Karpenter v1.0.5 on EKS v1.28 environment.

$ k  describe  nodepools.karpenter.sh
Name:         generic-worker-autoscale
Namespace:
Labels:       <none>
Annotations:  compatibility.karpenter.sh/v1beta1-kubelet-conversion: {"clusterDNS":["169.254.20.10"]}
              compatibility.karpenter.sh/v1beta1-nodeclass-reference: {"name":"generic-worker-autoscale"}
              karpenter.sh/nodepool-hash: 17332112437454130997
              karpenter.sh/nodepool-hash-version: v2
API Version:  karpenter.sh/v1
Kind:         NodePool
Metadata:
  Creation Timestamp:  2024-09-13T23:40:33Z
  Generation:          1
  Resource Version:    4623212
  UID:                 22c7afe6-b1e2-4c45-bfed-cf7099378f39
Spec:
  Disruption:
    Budgets:
      Nodes:               10%
    Consolidate After:     5m0s
    Consolidation Policy:  WhenEmpty
  Limits:
    Cpu:  100
  Template:
    Metadata:
      Labels:
        Dedicated:  generic-worker-autoscale
    Spec:
      Expire After:  Never
      Node Class Ref:
        Group:  karpenter.k8s.aws
        Kind:   EC2NodeClass
        Name:   generic-worker-autoscale
      Requirements:
        Key:       karpenter.sh/capacity-type
        Operator:  In
        Values:
          spot
        Key:       kubernetes.io/arch
        Operator:  In
        Values:
          arm64
          amd64
        Key:       karpenter.k8s.aws/instance-cpu
        Operator:  Lt
        Values:
          129
        Key:       karpenter.k8s.aws/instance-cpu
        Operator:  Gt
        Values:
          3
        Key:       karpenter.k8s.aws/instance-category
        Operator:  NotIn
        Values:
          t
          a
          g
        Key:       karpenter.k8s.aws/instance-family
        Operator:  NotIn
        Values:
          m1
          m2
          m3
          m4
          c1
          c2
          c3
          c4
        Key:       kubernetes.io/os
        Operator:  In
        Values:
          linux
      Startup Taints:
        Effect:  NoSchedule
        Key:     di.joby.aero/wait-on-node-critical
        Value:   true
      Taints:
        Effect:  NoSchedule
        Key:     dedicated
        Value:   generic-worker-autoscale
Status:
Events:  <none>
$

Has anyone found a fix so far other than rolling back to a previously working version?

I couldn't even cleanly roll-back. The finalizer on the CRDs was hanging talking to the /termination and, since webhooks were toast, hanging.

I applied my 0.37.3 back over the top and that worked. I cleanly uninstalled 0.37.3 and then installed 1.0.5 on the cluster and that worked, but not what I want to do for production. I kinda flailed around whacking on things in the deployment, so YMMV, but it appeared to be a certificate/CA issue on the 1.0.5 webhook. Saw some errors about X.509 not being able to determine the CA - didn't capture those details through.

Sorry for the delayed response here, I suspect this may be an issue we're already aware of and getting a fix out for but I have a few clarifying questions:

I can confirm that:
- validation.webhook.karpenter.sh
- validation.webhook.config.karpenter.sh
- defaulting.webhook.karpenter.k8s.aws
- validation.webhook.karpenter.k8s.aws
are removed during 1.0.4 deployment.

@AndrzejWisniewski how did you go about confirming this? Was this done by checking that the resources were not included in the updated helm deployment, or were you able to confirm that they were removed from the cluster? There's some more context here, but because knative adds owner references to the objects, Argo won't prune the objects even if it was the original creator.

Though even by removing them manually and restarting the deployment after the upgrade we saw the same errors.

@laserpedro This is extremely surprising to me. Are you able to share what you did to remove the webhooks, and what errors you continued to see?

I suspect this may be an issue we're already aware of and getting a fix out for

I suspected as much considering some recent commits I've seen on your main branch, but I'd really like more transparency on what exactly is going on here. Is there maybe a link to a github issue of this thing you're aware of?

Truth be told we've spent the last two weeks trying to upgrade to karpenter 1.x and we've encountered problem after problem after problem.

Following up on my previous post, I've instructed my team to put the upgrade on pause until the fix for whatever this issue is that you may already be aware of, is released.

Sorry for the delayed response here, I suspect this may be an issue we're already aware of and getting a fix out for but I have a few clarifying questions:

I can confirm that:

  • validation.webhook.karpenter.sh
  • validation.webhook.config.karpenter.sh
  • defaulting.webhook.karpenter.k8s.aws
  • validation.webhook.karpenter.k8s.aws
    are removed during 1.0.4 deployment.

@AndrzejWisniewski how did you go about confirming this? Was this done by checking that the resources were not included in the updated helm deployment, or were you able to confirm that they were removed from the cluster? There's some more context here, but because knative adds owner references to the objects, Argo won't prune the objects even if it was the original creator.

Though even by removing them manually and restarting the deployment after the upgrade we saw the same errors.

@laserpedro This is extremely surprising to me. Are you able to share what you did to remove the webhooks, and what errors you continued to see?

@jmdeal thanks you for watching this issue.

This morning I tried to upgrade again to reproduce the behavior I witnessed:

  • upgrade the helm chart version from 0.37.3 to 1.0.5
  • ArgoCD synch the Karpenter application
  • From the ArgoCD UI perspective the validating and mutating webhooks are deleted and the operators restarted.
  • RECTIFICATION: my apologies @jmdeal the sequence of errors is a bit different sorry for this but with my repetitive attempts to upgrade I mixed the errors. So after the controller is restarted it displays the error:
    failed calling webhook \"defaulting.webhook.karpenter.k8s.aws\": failed to call webhook: the server rejected our request for an unknown reason"
    The I manually delete the validating and mutating webhooks using a kubectl delete and then I restart the operator:
    I am witnessing now on the controller logs:

{"level":"ERROR","time":"2024-10-04T05:22:08.702Z","logger":"webhook","message":"http: TLS handshake error from 10.247.80.103:50080: read tcp 100.67.7.198:8443->10.247.80.103:50080: read: connection reset by peer\n","commit":"652e6aa"} (our node sg is properly configured to accept TCP from the control plane on 8443)

I guess this time it is the conversion webhook that cannot establish the connection with the control plane (10.247.x.x matches my control plane cidr range , 100.67.x.x matches my dataplane cidr range). Please rectify me if my understanding is not correct here.

(I am also seeing a synch error from ArgoCD on the cpu limits where "1000" is casted as 1k (but that is more related to ArgoCD I guess, and let s solve one problem at a time).)

We just released latest patch versions of pre-v1.0 versions that fix this issue so that these configuration resources aren't leaked. Please use one of the following versions prior to going to 1.0.x since these versions remove the ownerReference that is causing Argo to leak the resources and causes the failure on upgrade:

  • v0.33.10
  • v0.34.11
  • v0.35.10
  • v0.36.7
  • v0.37.5

Also, this seems related to points discussed in #6982 and #6847. See this comment for a description of why this occurs.

The I manually delete the validating and mutating webhooks using a kubectl delete and then I restart the operator:
I am witnessing now on the controller logs:

{"level":"ERROR","time":"2024-10-04T05:22:08.702Z","logger":"webhook","message":"http: TLS handshake error from > 10.247.80.103:50080: read tcp 100.67.7.198:8443->10.247.80.103:50080: read: connection reset by peer\n","commit":"652e6aa"}

(our node sg is properly configured to accept TCP from the control plane on 8443)

This appears to be this issue: #6898. This is a spurious error that happens generally in k8s when using webhooks, you should be able to safely ignore it. There's more context in an older issue: kubernetes-sigs/karpenter#718. If you think there's an issue with Karpenter's operations that are related to this error, please leave an update in #6898.

I'm going to close this issue out now that the releases with the fixes have been made.