Startup probe failed: HTTP probe failed with statuscode: 500

Question

Startup probe failed: HTTP probe failed with statuscode: 500

robcharlwood opened this issue a year ago · 9 comments

Hi

We are seeing a problem with the latest version of the rancher-webhook (0.3.5) when running alongside the latest rancher (2.7.6). In both the Rancher HA cluster and imported K3S and GKE downstream clusters, the webhook pod has a warning about startup probe checks failing with status code 500.

Events:
  Type     Reason     Age               From               Message
  ----     ------     ----              ----               -------
  Normal   Scheduled  15s               default-scheduler  Successfully assigned cattle-system/rancher-webhook-998454b77-nvch5 to <redacted>
  Normal   Pulled     14s               kubelet            Container image "rancher/rancher-webhook:v0.3.5" already present on machine
  Normal   Created    14s               kubelet            Created container rancher-webhook
  Normal   Started    14s               kubelet            Started container rancher-webhook
  Warning  Unhealthy  5s (x2 over 10s)  kubelet            Startup probe failed: HTTP probe failed with statuscode: 500

If left for long enough, it eventually starts failing with a liveness probe error:

Events:
  Type     Reason     Age                 From     Message
  ----     ------     ----                ----     -------
  Warning  Unhealthy  41m (x52 over 19h)  kubelet  Liveness probe failed: Get "https://XXX.XXX.XXX.XXX:9443/healthz": net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)

This is only ever generated as a warning and the pod itself never becomes unhealthy. The pod itself also does not give any useful logs:

time="2023-09-13T10:22:52Z" level=info msg="Rancher-webhook version v0.3.5 (2e89c65) is starting"
time="2023-09-13T10:22:52Z" level=info msg="Active TLS secret cattle-system/cattle-webhook-tls (ver=5511970) (count 1): map[listener.cattle.io/cn-rancher-webhook.cattle-system.svc:rancher-webhook.cattle-system.svc listener.cattle.io/fingerprint:SHA1=XXXXXXXXXXXXXXXXXXXXXXXXXXXX]"
time="2023-09-13T10:22:52Z" level=info msg="Listening on :9443"
time="2023-09-13T10:22:52Z" level=info msg="Starting rbac.authorization.k8s.io/v1, Kind=ClusterRole controller"
time="2023-09-13T10:22:52Z" level=info msg="Starting management.cattle.io/v3, Kind=Cluster controller"
time="2023-09-13T10:22:52Z" level=info msg="Starting management.cattle.io/v3, Kind=ClusterRoleTemplateBinding controller"
time="2023-09-13T10:22:52Z" level=info msg="Starting management.cattle.io/v3, Kind=GlobalRole controller"
time="2023-09-13T10:22:52Z" level=info msg="Starting /v1, Kind=Secret controller"
time="2023-09-13T10:22:52Z" level=info msg="Sleeping for 15 seconds then applying webhook config"
time="2023-09-13T10:22:52Z" level=info msg="Starting rbac.authorization.k8s.io/v1, Kind=RoleBinding controller"
time="2023-09-13T10:22:52Z" level=info msg="Starting rbac.authorization.k8s.io/v1, Kind=ClusterRoleBinding controller"
time="2023-09-13T10:22:52Z" level=info msg="Starting management.cattle.io/v3, Kind=PodSecurityAdmissionConfigurationTemplate controller"
time="2023-09-13T10:22:52Z" level=info msg="Starting provisioning.cattle.io/v1, Kind=Cluster controller"
time="2023-09-13T10:22:53Z" level=info msg="Starting management.cattle.io/v3, Kind=ProjectRoleTemplateBinding controller"
time="2023-09-13T10:22:53Z" level=info msg="Starting apiregistration.k8s.io/v1, Kind=APIService controller"
time="2023-09-13T10:22:53Z" level=info msg="Starting apiextensions.k8s.io/v1, Kind=CustomResourceDefinition controller"
time="2023-09-13T10:22:53Z" level=info msg="Starting rbac.authorization.k8s.io/v1, Kind=Role controller"
time="2023-09-13T10:22:53Z" level=info msg="Starting management.cattle.io/v3, Kind=RoleTemplate controller"
time="2023-09-13T10:22:53Z" level=info msg="Updating TLS secret for cattle-system/cattle-webhook-tls (count: 1): map[listener.cattle.io/cn-rancher-webhook.cattle-system.svc:rancher-webhook.cattle-system.svc listener.cattle.io/fingerprint:SHA1=XXXXXXXXXXXXXXXXXXXXXXXXXXXX]"

This rancher is deployed in the following manner:

Private GKE cluster running in Google Cloud with etc encryption using custom KMS key
Cluster is running 1.26.4-gke.500 of Kubernetes
We allow GKE control plane ingress to the webhook on port 9443 via TCP in our firewall rules as per the docs

Any help or advice on this issue would be appreciated.

Many thanks!

Answer 1 · 2023-09-18T20:21:42.000Z

I don't have any immediate solutions to your problem, but it looks like the root cause is that kube-apiserver cannot communicate with the container running on the cluster.

To verify the problem is not with the webhook, you can validate the webhook configuration was created successfully
kubectl get validatingwebhookconfigurations rancher.cattle.io -o yaml

Your other solution, which is available on Rancher:v2.7-head but has not been released yet would be to have the webhook run on port 443
rancher/rancher#41142 (comment)

Answer 2 · 2023-09-20T07:52:52.000Z

@KevinJoiner Thanks! I will check this and get back to you.

Answer 3 · 2023-10-23T07:53:38.000Z

@KevinJoiner - So I ran the suggested command and YAML was returned successfully. I can't see anything problematic in the output. Is there anything specific I should be looking for?

Answer 4 · 2023-10-23T12:22:08.000Z

@robcharlwood No, if the resource exists and the webhook is not logging any errors we can have higher confidence that the problem is with the connection between the kube-apiserver and the rancher-webhook pod.

I would double checks the steps for adding the firewall rule to make sure it is correctly configured since the symptoms seem to match https://cloud.google.com/kubernetes-engine/docs/how-to/private-clusters#api_request_that_triggers_admission_webhook_timing_out
You can try to edit the deployment of the Webhook and remove the startupProbe and livelinessProbe and see if things start to work. I don't expect this to fix the problem since other requests will most likely time out when you try to create a RoleTemplate, but if it does work, we might have a bug on our side.

Answer 5 · 2023-10-25T09:19:02.000Z

@KevinJoiner Thanks! I will investigate and report back!

Answer 6 · 2023-12-14T16:35:38.000Z

We are experiencing the same issue.

Rancher v2.7.6 deployed on k3s 1.25.10
downstream cluster k8s vanilla + cilium (we also tried with calico): 1.28.2

Firewall rules allow any communication between nodes (trusted)

Answer 7 · 2023-12-15T11:15:23.000Z

Adding some extra info:

We reconfigured and tried to import a new mini k8s cluster (1M + 1W) multple times with different k8s versions (1.28.2, 1.25.12, 1.24.4. ) All tests failed.
The two machines we ran our test on had already been successfully imported previously (Rancher 2.5 and K8S 1.24.4).
We created a custom RoleTemplate and assigned it to a user on the downstream cluster and it seemed to work without any issue.

For the moment being we will try and remove the startupProbe and livelinessProbe

Answer 8 · 2024-01-17T17:11:48.000Z

I have very same issue:

Rancher UI: 2.7.9
RKE version: v1.5.1   
K8s: v1.25.16

kubectl describe pod -n cattle-system rancher-webhook-7879bb6c5-vb7ss

Events:
  Type     Reason     Age                     From     Message
  ----     ------     ----                    ----     -------
  Warning  Unhealthy  85s (x28071 over 2d1h)  kubelet  Startup probe failed: HTTP probe failed with statuscode: 500

Problem started after upgrading from previous version.

Answer 9 · 2024-06-05T02:50:16.000Z

I have same issue

Rancher chart: rancher-2.8.1
Rancher webhook chart: rancher-webhook-103.0.1+up0.4.2
Kubernetes: v1.27.13

When try to hit the /healthz endpoint, I got this log message:

[-]Config Applied failed: reason withheld
healthz check failed

I'm struggle with reason withheld error because can't find out what the root cause.