Error "failed to call webhook: the server rejected our request for an unknown reason" after upgrade from 0.37.4 to 1.0.4.
Closed this issue ยท 11 comments
Description
Observed Behavior:
After upgrade from 0.37.4 to 1.0.4 I can see a lot of such errors in the karpenter logs:
"... Internal error occurred: failed calling webhook "validation.webhook.karpenter.sh": failed to call webhook: the server rejected our request for an unknown reason ..." and Internal error occurred: failed calling webhook \"defaulting.webhook.karpenter.k8s.aws\": failed to call webhook: the server rejected our request for an unknown reason"}
and I can confirm that:
validation.webhook.karpenter.sh
validation.webhook.config.karpenter.sh
defaulting.webhook.karpenter.k8s.aws
validation.webhook.karpenter.k8s.aws
are removed during 1.0.4 deployment.
Reproduction Steps (Please include YAML):
Values.yaml for ArgoCD:
fullnameOverride: "karpenter"
serviceAccount:
annotations:
eks.amazonaws.com/role-arn: arn:aws:iam::XXXXXXXX:role/iamservice-role
settings:
clusterName: cluster1
interruptionQueue: cluster1-karpenter-interruptions
featureGates:
spotToSpotConsolidation: true
podAnnotations:
prometheus.io/scrape: "true"
prometheus.io/path: /metrics
prometheus.io/port: "8000"
webhook:
enabled: true
I also use such kustomizatin patches (as karpenter is deployed in karpenterns
namespace:
patches:
- path: patches/karpenter/crds/nodepools.json
target:
kind: CustomResourceDefinition
name: nodepools.karpenter.sh
- path: patches/karpenter/crds/ec2nodeclasses.json
target:
kind: CustomResourceDefinition
name: ec2nodeclasses.karpenter.k8s.aws
- path: patches/karpenter/crds/nodeclaims.json
target:
kind: CustomResourceDefinition
name: nodeclaims.karpenter.sh
cat patches/karpenter/crds/ec2nodeclasses.json
[
{
"op": "add",
"path": "/spec/conversion/webhook/clientConfig/service/namespace",
"value": "karpenterns"
}
]
cat patches/karpenter/crds/nodeclaims.json
[
{
"op": "add",
"path": "/spec/conversion/webhook/clientConfig/service/namespace",
"value": "karpenterns"
}
]
cat patches/karpenter/crds/nodepools.json
[
{
"op": "add",
"path": "/spec/conversion/webhook/clientConfig/service/namespace",
"value": "karpenterns"
}
]
Versions:
- Chart Version:
1.0.4
- Kubernetes Version (
kubectl version
):Server Version: v1.28.12-eks-a18cd3a
- Please vote on this issue by adding a ๐ reaction to the original issue to help the community and maintainers prioritize this request
- Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
- If you are interested in working on this issue or have submitted a pull request, please leave a comment
Hello!
I experienced the exact same issue when upgrading from 0.37.3 to 1.0.5 : the only difference in my case is that I had to wait to 1.0.5 that enables the migration to v1 with the webhooks (however not clear for me what webhooks are we taking about ? the ones already listed by @AndrzejWisniewski I guess) since ArgoCD is preventing the deletion of the webhooks currently. Though even by removing them manually and restarting the deployment after the upgrade we saw the same errors.
After this we can no longer provision any new nodes, however the rollback procedure to 0.37.3 is working pretty well as we have tried to upgrade to different 1.0.x versions on our dev cluster.
We are working on EKS 1.30 and our current deployment of Karpenter is done via an ArgoCD App:
Chart.yaml
apiVersion: v2
name: karpenter
description: A Helm chart for installing Karpenter
type: application
version: 0.3.0
appVersion: "0.37.3"
dependencies:
- name: karpenter
version: 0.37.3
repository: oci://private-registry/karpenter
values.yaml
karpenter:
webhook:
enabled: true
podAnnotations:
prometheus.io/scrape: "true"
prometheus.io/path: "/metrics"
prometheus.io/port: "8000"
...
I would be glad to give more details if required.
Thanks !
Seeing exact same error from an upgrade last night from v0.37.3 to v1.0.5 on K8s v1.28 (AWS EKS).
Our deployment path is using "helm template" to generate lightly templated (just name prefixes, IRSA annotations, etc) resource files that our own simple manual deployment tool (Kontemplate) fills in and replaces. Am digging in this morning and found this thread.
Our values.yaml is just:
serviceAccount:
annotations:
eks.amazonaws.com/role-arn: arn:aws:iam::{{.aws_account}}:role/{{.aws_role_path}}/{{.workspace}}-irsa-karpenter-v1beta1
settings:
clusterName: x{x{.workspace}}-telemetry
interruptionQueue: x{x{.workspace}}-telemetry-karpenter
# logLevel: debug
helm/helm-values.yaml (END)
The errors start with a TLS handshake error and then a collection of reconciler errors:
{"level":"ERROR","time":"2024-10-02T15:08:28.477Z","logger":"webhook","message":"http: TLS handshake error from 10.95.167.138:34812: EOF\n","commit":"652e6aa"}
{"level":"ERROR","time":"2024-10-02T15:08:28.482Z","logger":"controller","message":"Reconciler error","commit":"652e6aa","controller":"nodepool.readiness","controllerGroup":"karpenter.sh","controllerKind":"NodePool","NodePool":{"name":"generic-worker-autoscale"},"namespace":"","name":"generic-worker-autoscale","reconcileID":"bf399b7e-0f49-48c2-b3e7-e8be4709580d","error":"Internal error occurred: failed calling webhook \"validation.webhook.karpenter.sh\": failed to call webhook: the server rejected our request for an unknown reason"}
{"level":"ERROR","time":"2024-10-02T15:08:28.484Z","logger":"controller","message":"Reconciler error","commit":"652e6aa","controller":"migration.resource.nodepool","controllerGroup":"karpenter.sh","controllerKind":"NodePool","NodePool":{"name":"generic-worker-autoscale"},"namespace":"","name":"generic-worker-autoscale","reconcileID":"c5d363c4-ed00-4d7b-b638-593215a89bdf","error":"adding karpenter.sh/stored-version-migrated annotation to karpenter.sh/v1, Kind=NodePool, Internal error occurred: failed calling webhook \"validation.webhook.karpenter.sh\": failed to call webhook: the server rejected our request for an unknown reason"}
{"level":"ERROR","time":"2024-10-02T15:08:28.844Z","logger":"controller","message":"Reconciler error","commit":"652e6aa","controller":"migration.resource.ec2nodeclass","controllerGroup":"karpenter.k8s.aws","controllerKind":"EC2NodeClass","EC2NodeClass":{"name":"generic-worker-autoscale"},"namespace":"","name":"generic-worker-autoscale","reconcileID":"2212f030-3b93-4234-9f7a-77e9b6f5a24f","error":"adding karpenter.sh/stored-version-migrated annotation to karpenter.k8s.aws/v1, Kind=EC2NodeClass, Internal error occurred: failed calling webhook \"defaulting.webhook.karpenter.k8s.aws\": failed to call webhook: the server rejected our request for an unknown reason"}
{"level":"ERROR","time":"2024-10-02T15:08:28.977Z","logger":"controller","message":"Reconciler error","commit":"652e6aa","controller":"nodepool.counter","controllerGroup":"karpenter.sh","controllerKind":"NodePool","NodePool":{"name":"generic-worker-autoscale"},"namespace":"","name":"generic-worker-autoscale","reconcileID":"f9ad0b14-cf7d-4bfb-a922-515fe3b3432e","error":"Internal error occurred: failed calling webhook \"validation.webhook.karpenter.sh\": failed to call webhook: the server rejected our request for an unknown reason"}
{"level":"ERROR","time":"2024-10-02T15:08:33.277Z","logger":"controller","message":"Reconciler error","commit":"652e6aa","controller":"nodeclass.hash","controllerGroup":"karpenter.k8s.aws","controllerKind":"EC2NodeClass","EC2NodeClass":{"name":"generic-worker-autoscale"},"namespace":"","name":"generic-worker-autoscale","reconcileID":"2bf04eae-46d4-4e3d-bc58-f245d40a4d05","error":"Internal error occurred: failed calling webhook \"defaulting.webhook.karpenter.k8s.aws\": failed to call webhook: the server rejected our request for an unknown reason"}
{"level":"ERROR","time":"2024-10-02T15:08:43.801Z","logger":"controller","message":"Reconciler error","commit":"652e6aa","controller":"nodeclass.status","controllerGroup":"karpenter.k8s.aws","controllerKind":"EC2NodeClass","EC2NodeClass":{"name":"generic-worker-autoscale"},"namespace":"","name":"generic-worker-autoscale","reconcileID":"19cc44d4-18a4-4898-8a33-5cb3405543d8","error":"Internal error occurred: failed calling webhook \"defaulting.webhook.karpenter.k8s.aws\": failed to call webhook: the server rejected our request for an unknown reason"}
Here's what my NodePool looks like on that cluster. I need to dig through the docs to see if the annotations will give me a clue. The NodePool looks like it is already a karpenter.sh/v1. Looking at my other clusters that are still on v0.37.3 (done just before .4 dropped), they all show karpenter.sh/v1 as well - look identical to this one below on my Karpenter v1.0.5 on EKS v1.28 environment.
$ k describe nodepools.karpenter.sh
Name: generic-worker-autoscale
Namespace:
Labels: <none>
Annotations: compatibility.karpenter.sh/v1beta1-kubelet-conversion: {"clusterDNS":["169.254.20.10"]}
compatibility.karpenter.sh/v1beta1-nodeclass-reference: {"name":"generic-worker-autoscale"}
karpenter.sh/nodepool-hash: 17332112437454130997
karpenter.sh/nodepool-hash-version: v2
API Version: karpenter.sh/v1
Kind: NodePool
Metadata:
Creation Timestamp: 2024-09-13T23:40:33Z
Generation: 1
Resource Version: 4623212
UID: 22c7afe6-b1e2-4c45-bfed-cf7099378f39
Spec:
Disruption:
Budgets:
Nodes: 10%
Consolidate After: 5m0s
Consolidation Policy: WhenEmpty
Limits:
Cpu: 100
Template:
Metadata:
Labels:
Dedicated: generic-worker-autoscale
Spec:
Expire After: Never
Node Class Ref:
Group: karpenter.k8s.aws
Kind: EC2NodeClass
Name: generic-worker-autoscale
Requirements:
Key: karpenter.sh/capacity-type
Operator: In
Values:
spot
Key: kubernetes.io/arch
Operator: In
Values:
arm64
amd64
Key: karpenter.k8s.aws/instance-cpu
Operator: Lt
Values:
129
Key: karpenter.k8s.aws/instance-cpu
Operator: Gt
Values:
3
Key: karpenter.k8s.aws/instance-category
Operator: NotIn
Values:
t
a
g
Key: karpenter.k8s.aws/instance-family
Operator: NotIn
Values:
m1
m2
m3
m4
c1
c2
c3
c4
Key: kubernetes.io/os
Operator: In
Values:
linux
Startup Taints:
Effect: NoSchedule
Key: di.joby.aero/wait-on-node-critical
Value: true
Taints:
Effect: NoSchedule
Key: dedicated
Value: generic-worker-autoscale
Status:
Events: <none>
$
Has anyone found a fix so far other than rolling back to a previously working version?
I couldn't even cleanly roll-back. The finalizer on the CRDs was hanging talking to the /termination and, since webhooks were toast, hanging.
I applied my 0.37.3 back over the top and that worked. I cleanly uninstalled 0.37.3 and then installed 1.0.5 on the cluster and that worked, but not what I want to do for production. I kinda flailed around whacking on things in the deployment, so YMMV, but it appeared to be a certificate/CA issue on the 1.0.5 webhook. Saw some errors about X.509 not being able to determine the CA - didn't capture those details through.
Sorry for the delayed response here, I suspect this may be an issue we're already aware of and getting a fix out for but I have a few clarifying questions:
I can confirm that:
- validation.webhook.karpenter.sh
- validation.webhook.config.karpenter.sh
- defaulting.webhook.karpenter.k8s.aws
- validation.webhook.karpenter.k8s.aws
are removed during 1.0.4 deployment.
@AndrzejWisniewski how did you go about confirming this? Was this done by checking that the resources were not included in the updated helm deployment, or were you able to confirm that they were removed from the cluster? There's some more context here, but because knative adds owner references to the objects, Argo won't prune the objects even if it was the original creator.
Though even by removing them manually and restarting the deployment after the upgrade we saw the same errors.
@laserpedro This is extremely surprising to me. Are you able to share what you did to remove the webhooks, and what errors you continued to see?
I suspect this may be an issue we're already aware of and getting a fix out for
I suspected as much considering some recent commits I've seen on your main branch, but I'd really like more transparency on what exactly is going on here. Is there maybe a link to a github issue of this thing you're aware of?
Truth be told we've spent the last two weeks trying to upgrade to karpenter 1.x and we've encountered problem after problem after problem.
Following up on my previous post, I've instructed my team to put the upgrade on pause until the fix for whatever this issue is that you may already be aware of, is released.
Sorry for the delayed response here, I suspect this may be an issue we're already aware of and getting a fix out for but I have a few clarifying questions:
I can confirm that:
- validation.webhook.karpenter.sh
- validation.webhook.config.karpenter.sh
- defaulting.webhook.karpenter.k8s.aws
- validation.webhook.karpenter.k8s.aws
are removed during 1.0.4 deployment.@AndrzejWisniewski how did you go about confirming this? Was this done by checking that the resources were not included in the updated helm deployment, or were you able to confirm that they were removed from the cluster? There's some more context here, but because knative adds owner references to the objects, Argo won't prune the objects even if it was the original creator.
Though even by removing them manually and restarting the deployment after the upgrade we saw the same errors.
@laserpedro This is extremely surprising to me. Are you able to share what you did to remove the webhooks, and what errors you continued to see?
@jmdeal thanks you for watching this issue.
This morning I tried to upgrade again to reproduce the behavior I witnessed:
- upgrade the helm chart version from 0.37.3 to 1.0.5
- ArgoCD synch the Karpenter application
- From the ArgoCD UI perspective the validating and mutating webhooks are deleted and the operators restarted.
- RECTIFICATION: my apologies @jmdeal the sequence of errors is a bit different sorry for this but with my repetitive attempts to upgrade I mixed the errors. So after the controller is restarted it displays the error:
failed calling webhook \"defaulting.webhook.karpenter.k8s.aws\": failed to call webhook: the server rejected our request for an unknown reason"
The I manually delete the validating and mutating webhooks using a kubectl delete and then I restart the operator:
I am witnessing now on the controller logs:
{"level":"ERROR","time":"2024-10-04T05:22:08.702Z","logger":"webhook","message":"http: TLS handshake error from 10.247.80.103:50080: read tcp 100.67.7.198:8443->10.247.80.103:50080: read: connection reset by peer\n","commit":"652e6aa"}
(our node sg is properly configured to accept TCP from the control plane on 8443)
I guess this time it is the conversion webhook that cannot establish the connection with the control plane (10.247.x.x matches my control plane cidr range , 100.67.x.x matches my dataplane cidr range). Please rectify me if my understanding is not correct here.
(I am also seeing a synch error from ArgoCD on the cpu limits where "1000" is casted as 1k (but that is more related to ArgoCD I guess, and let s solve one problem at a time).)
We just released latest patch versions of pre-v1.0 versions that fix this issue so that these configuration resources aren't leaked. Please use one of the following versions prior to going to 1.0.x since these versions remove the ownerReference that is causing Argo to leak the resources and causes the failure on upgrade:
- v0.33.10
- v0.34.11
- v0.35.10
- v0.36.7
- v0.37.5
Also, this seems related to points discussed in #6982 and #6847. See this comment for a description of why this occurs.
The I manually delete the validating and mutating webhooks using a kubectl delete and then I restart the operator:
I am witnessing now on the controller logs:{"level":"ERROR","time":"2024-10-04T05:22:08.702Z","logger":"webhook","message":"http: TLS handshake error from > 10.247.80.103:50080: read tcp 100.67.7.198:8443->10.247.80.103:50080: read: connection reset by peer\n","commit":"652e6aa"}
(our node sg is properly configured to accept TCP from the control plane on 8443)
This appears to be this issue: #6898. This is a spurious error that happens generally in k8s when using webhooks, you should be able to safely ignore it. There's more context in an older issue: kubernetes-sigs/karpenter#718. If you think there's an issue with Karpenter's operations that are related to this error, please leave an update in #6898.
I'm going to close this issue out now that the releases with the fixes have been made.