Startup probe failed: HTTP probe failed with statuscode: 500
robcharlwood opened this issue · 9 comments
Hi
We are seeing a problem with the latest version of the rancher-webhook (0.3.5) when running alongside the latest rancher (2.7.6). In both the Rancher HA cluster and imported K3S and GKE downstream clusters, the webhook pod has a warning about startup probe checks failing with status code 500.
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 15s default-scheduler Successfully assigned cattle-system/rancher-webhook-998454b77-nvch5 to <redacted>
Normal Pulled 14s kubelet Container image "rancher/rancher-webhook:v0.3.5" already present on machine
Normal Created 14s kubelet Created container rancher-webhook
Normal Started 14s kubelet Started container rancher-webhook
Warning Unhealthy 5s (x2 over 10s) kubelet Startup probe failed: HTTP probe failed with statuscode: 500
If left for long enough, it eventually starts failing with a liveness probe error:
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning Unhealthy 41m (x52 over 19h) kubelet Liveness probe failed: Get "https://XXX.XXX.XXX.XXX:9443/healthz": net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
This is only ever generated as a warning and the pod itself never becomes unhealthy. The pod itself also does not give any useful logs:
time="2023-09-13T10:22:52Z" level=info msg="Rancher-webhook version v0.3.5 (2e89c65) is starting"
time="2023-09-13T10:22:52Z" level=info msg="Active TLS secret cattle-system/cattle-webhook-tls (ver=5511970) (count 1): map[listener.cattle.io/cn-rancher-webhook.cattle-system.svc:rancher-webhook.cattle-system.svc listener.cattle.io/fingerprint:SHA1=XXXXXXXXXXXXXXXXXXXXXXXXXXXX]"
time="2023-09-13T10:22:52Z" level=info msg="Listening on :9443"
time="2023-09-13T10:22:52Z" level=info msg="Starting rbac.authorization.k8s.io/v1, Kind=ClusterRole controller"
time="2023-09-13T10:22:52Z" level=info msg="Starting management.cattle.io/v3, Kind=Cluster controller"
time="2023-09-13T10:22:52Z" level=info msg="Starting management.cattle.io/v3, Kind=ClusterRoleTemplateBinding controller"
time="2023-09-13T10:22:52Z" level=info msg="Starting management.cattle.io/v3, Kind=GlobalRole controller"
time="2023-09-13T10:22:52Z" level=info msg="Starting /v1, Kind=Secret controller"
time="2023-09-13T10:22:52Z" level=info msg="Sleeping for 15 seconds then applying webhook config"
time="2023-09-13T10:22:52Z" level=info msg="Starting rbac.authorization.k8s.io/v1, Kind=RoleBinding controller"
time="2023-09-13T10:22:52Z" level=info msg="Starting rbac.authorization.k8s.io/v1, Kind=ClusterRoleBinding controller"
time="2023-09-13T10:22:52Z" level=info msg="Starting management.cattle.io/v3, Kind=PodSecurityAdmissionConfigurationTemplate controller"
time="2023-09-13T10:22:52Z" level=info msg="Starting provisioning.cattle.io/v1, Kind=Cluster controller"
time="2023-09-13T10:22:53Z" level=info msg="Starting management.cattle.io/v3, Kind=ProjectRoleTemplateBinding controller"
time="2023-09-13T10:22:53Z" level=info msg="Starting apiregistration.k8s.io/v1, Kind=APIService controller"
time="2023-09-13T10:22:53Z" level=info msg="Starting apiextensions.k8s.io/v1, Kind=CustomResourceDefinition controller"
time="2023-09-13T10:22:53Z" level=info msg="Starting rbac.authorization.k8s.io/v1, Kind=Role controller"
time="2023-09-13T10:22:53Z" level=info msg="Starting management.cattle.io/v3, Kind=RoleTemplate controller"
time="2023-09-13T10:22:53Z" level=info msg="Updating TLS secret for cattle-system/cattle-webhook-tls (count: 1): map[listener.cattle.io/cn-rancher-webhook.cattle-system.svc:rancher-webhook.cattle-system.svc listener.cattle.io/fingerprint:SHA1=XXXXXXXXXXXXXXXXXXXXXXXXXXXX]"
This rancher is deployed in the following manner:
- Private GKE cluster running in Google Cloud with etc encryption using custom KMS key
- Cluster is running 1.26.4-gke.500 of Kubernetes
- We allow GKE control plane ingress to the webhook on port 9443 via TCP in our firewall rules as per the docs
Any help or advice on this issue would be appreciated.
Many thanks!
I don't have any immediate solutions to your problem, but it looks like the root cause is that kube-apiserver cannot communicate with the container running on the cluster.
To verify the problem is not with the webhook, you can validate the webhook configuration was created successfully
kubectl get validatingwebhookconfigurations rancher.cattle.io -o yaml
Your other solution, which is available on Rancher:v2.7-head but has not been released yet would be to have the webhook run on port 443
rancher/rancher#41142 (comment)
@KevinJoiner Thanks! I will check this and get back to you.
@KevinJoiner - So I ran the suggested command and YAML was returned successfully. I can't see anything problematic in the output. Is there anything specific I should be looking for?
@robcharlwood No, if the resource exists and the webhook is not logging any errors we can have higher confidence that the problem is with the connection between the kube-apiserver and the rancher-webhook pod.
-
I would double checks the steps for adding the firewall rule to make sure it is correctly configured since the symptoms seem to match https://cloud.google.com/kubernetes-engine/docs/how-to/private-clusters#api_request_that_triggers_admission_webhook_timing_out
-
You can try to edit the deployment of the Webhook and remove the
startupProbe
andlivelinessProbe
and see if things start to work. I don't expect this to fix the problem since other requests will most likely time out when you try to create a RoleTemplate, but if it does work, we might have a bug on our side.
@KevinJoiner Thanks! I will investigate and report back!
We are experiencing the same issue.
- Rancher v2.7.6 deployed on k3s 1.25.10
- downstream cluster k8s vanilla + cilium (we also tried with calico): 1.28.2
Firewall rules allow any communication between nodes (trusted)
Adding some extra info:
- We reconfigured and tried to import a new mini k8s cluster (1M + 1W) multple times with different k8s versions (1.28.2, 1.25.12, 1.24.4. ) All tests failed.
- The two machines we ran our test on had already been successfully imported previously (Rancher 2.5 and K8S 1.24.4).
- We created a custom RoleTemplate and assigned it to a user on the downstream cluster and it seemed to work without any issue.
For the moment being we will try and remove the startupProbe
and livelinessProbe
I have very same issue:
Rancher UI: 2.7.9
RKE version: v1.5.1
K8s: v1.25.16
kubectl describe pod -n cattle-system rancher-webhook-7879bb6c5-vb7ss
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning Unhealthy 85s (x28071 over 2d1h) kubelet Startup probe failed: HTTP probe failed with statuscode: 500
Problem started after upgrading from previous version.
I have same issue
Rancher chart: rancher-2.8.1
Rancher webhook chart: rancher-webhook-103.0.1+up0.4.2
Kubernetes: v1.27.13
When try to hit the /healthz
endpoint, I got this log message:
[-]Config Applied failed: reason withheld
healthz check failed
I'm struggle with reason withheld
error because can't find out what the root cause.