intel/intel-device-plugins-for-kubernetes

intel device plugins gpu : failed to call webhook, context deadline exceeded

Closed this issue · 10 comments

Llyr95 commented

Describe the support request
I am trying to install the intel device plugins gpu helm after installing the operator helm chart. It fails with

Helm install failed for release system/intel-device-plugin-gpu with chart intel-device-plugins-gpu@0.28.0: 1 error occurred: * Internal error occurred: failed calling webhook "mgpudeviceplugin.kb.io": failed to call webhook: Post "https://inteldeviceplugins-webhook-service.system.svc:443/mutate-deviceplugin-intel-com-v1-gpudeviceplugin?timeout=10s": context deadline exceeded

The helm chart is installed as a helmrelease via flux.

Thanks for you help

System (please complete the following information if applicable):

  • OS version: Talos v1.6.1
  • Device plugins version: v0.28.0
  • Hardware info: 3 hp prodesk mini pc gen 4

Hi @Llyr95

The webhook takes some time to get up, so if you try to install the CR "too soon" it may fail. Or maybe the webhook part of the operator is misbehaving.

Can you check if the controller-manager pod is fully up and running? If it isn't, can you share the logs:
kubectl logs -n <namespace> inteldeviceplugins-controller-manager-something-anything -c kube-rbac-proxy

Llyr95 commented

Hi @tkatila,

Thank you for answering

Here is the logs you asked for :

I0128 16:55:26.266275       1 flags.go:64] FLAG: --add-dir-header="false"
I0128 16:55:26.266322       1 flags.go:64] FLAG: --allow-paths="[]"
I0128 16:55:26.266328       1 flags.go:64] FLAG: --alsologtostderr="false"
I0128 16:55:26.266331       1 flags.go:64] FLAG: --auth-header-fields-enabled="false"
I0128 16:55:26.266335       1 flags.go:64] FLAG: --auth-header-groups-field-name="x-remote-groups"
I0128 16:55:26.266340       1 flags.go:64] FLAG: --auth-header-groups-field-separator="|"
I0128 16:55:26.266343       1 flags.go:64] FLAG: --auth-header-user-field-name="x-remote-user"
I0128 16:55:26.266346       1 flags.go:64] FLAG: --auth-token-audiences="[]"
I0128 16:55:26.266350       1 flags.go:64] FLAG: --client-ca-file=""
I0128 16:55:26.266353       1 flags.go:64] FLAG: --config-file=""
I0128 16:55:26.266355       1 flags.go:64] FLAG: --help="false"
I0128 16:55:26.266359       1 flags.go:64] FLAG: --ignore-paths="[]"
I0128 16:55:26.266362       1 flags.go:64] FLAG: --insecure-listen-address=""
I0128 16:55:26.266365       1 flags.go:64] FLAG: --kubeconfig=""
I0128 16:55:26.266368       1 flags.go:64] FLAG: --log-backtrace-at=":0"
I0128 16:55:26.266373       1 flags.go:64] FLAG: --log-dir=""
I0128 16:55:26.266376       1 flags.go:64] FLAG: --log-file=""
I0128 16:55:26.266379       1 flags.go:64] FLAG: --log-file-max-size="1800"
I0128 16:55:26.266382       1 flags.go:64] FLAG: --log-flush-frequency="5s"
I0128 16:55:26.266385       1 flags.go:64] FLAG: --logtostderr="true"
I0128 16:55:26.266388       1 flags.go:64] FLAG: --oidc-ca-file=""
I0128 16:55:26.266391       1 flags.go:64] FLAG: --oidc-clientID=""
I0128 16:55:26.266394       1 flags.go:64] FLAG: --oidc-groups-claim="groups"
I0128 16:55:26.266397       1 flags.go:64] FLAG: --oidc-groups-prefix=""
I0128 16:55:26.266399       1 flags.go:64] FLAG: --oidc-issuer=""
I0128 16:55:26.266402       1 flags.go:64] FLAG: --oidc-sign-alg="[RS256]"
I0128 16:55:26.266408       1 flags.go:64] FLAG: --oidc-username-claim="email"
I0128 16:55:26.266411       1 flags.go:64] FLAG: --one-output="false"
I0128 16:55:26.266414       1 flags.go:64] FLAG: --proxy-endpoints-port="0"
I0128 16:55:26.266417       1 flags.go:64] FLAG: --secure-listen-address="0.0.0.0:8443"
I0128 16:55:26.266420       1 flags.go:64] FLAG: --skip-headers="false"
I0128 16:55:26.266423       1 flags.go:64] FLAG: --skip-log-headers="false"
I0128 16:55:26.266426       1 flags.go:64] FLAG: --stderrthreshold="2"
I0128 16:55:26.266429       1 flags.go:64] FLAG: --tls-cert-file=""
I0128 16:55:26.266432       1 flags.go:64] FLAG: --tls-cipher-suites="[TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384,TLS_ECDHE_ECDSA_WITH_AES_256_GCM_SHA384,TLS_ECDHE_RSA_WITH_CHACHA20_POLY1305,TLS_ECDHE_ECDSA_WITH_CHACHA20_POLY1305]"
I0128 16:55:26.266440       1 flags.go:64] FLAG: --tls-min-version="VersionTLS12"
I0128 16:55:26.266444       1 flags.go:64] FLAG: --tls-private-key-file=""
I0128 16:55:26.266446       1 flags.go:64] FLAG: --tls-reload-interval="1m0s"
I0128 16:55:26.266451       1 flags.go:64] FLAG: --upstream="http://127.0.0.1:8080/"
I0128 16:55:26.266454       1 flags.go:64] FLAG: --upstream-ca-file=""
I0128 16:55:26.266457       1 flags.go:64] FLAG: --upstream-client-cert-file=""
I0128 16:55:26.266460       1 flags.go:64] FLAG: --upstream-client-key-file=""
I0128 16:55:26.266463       1 flags.go:64] FLAG: --upstream-force-h2c="false"
I0128 16:55:26.266466       1 flags.go:64] FLAG: --v="10"
I0128 16:55:26.266469       1 flags.go:64] FLAG: --version="false"
I0128 16:55:26.266473       1 flags.go:64] FLAG: --vmodule=""
W0128 16:55:26.266730       1 kube-rbac-proxy.go:152] 
==== Deprecation Warning ======================

Insecure listen address will be removed.
Using --insecure-listen-address won't be possible!

The ability to run kube-rbac-proxy without TLS certificates will be removed.
Not using --tls-cert-file and --tls-private-key-file won't be possible!

For more information, please go to https://github.com/brancz/kube-rbac-proxy/issues/187

===============================================

		
I0128 16:55:26.266757       1 kube-rbac-proxy.go:272] Valid token audiences: 
I0128 16:55:26.266788       1 kube-rbac-proxy.go:363] Generating self signed cert as no cert is provided
I0128 16:55:26.435337       1 kube-rbac-proxy.go:414] Starting TCP socket on 0.0.0.0:8443
I0128 16:55:26.435488       1 kube-rbac-proxy.go:421] Listening securely on 0.0.0.0:8443

Thanks, the logs seem ok.

If you try to re-apply the GPU CR, does it still fail?

Llyr95 commented

I have tried to reinstall the gpu plugin but I don't understand about the CR, from my testing, the custom resources definitions are installed with the operator helm charts. I install the gpu plugin helm charts after.

So how can I create the custom resources definitions of the operator before it creates the Webhook ? Or there is something I don't understand

I have tried running the daemonset via kubectl apply -k 'https://github.com/intel/intel-device-plugins-for-kubernetes/deployments/gpu_plugin/overlays/nfd_labeled_nodes?ref=<RELEASE_VERSION>' from https://intel.github.io/intel-device-plugins-for-kubernetes/cmd/gpu_plugin/README.html#install-with-nfd and that worked. However the goal is to install the operator and gpu plugin helm charts

You can't really change the creation order. The operator chart creates the CRDs and the GPU plugin chart initiates a CR.

The reason why I asked about the re-creation is timing. The operator helm chart installs the CRDs and the operator Pod. But helm doesn't (unless asked) wait for the Pods to become available. The webhook especially takes some time to come up and if the GPU CR is deployed during that time, it will fail.

With helm cli, if you install the operator and the gpu back-to-back:
helm install operator intel/intel-device-plugins-operator && helm install gpu intel/intel-device-plugins-gpu --set nodeFeatureRule=true
The second part might fail as the webhook is not yet running.

The fix for this is to tell the helm cli to wait for the deployment (--wait):
helm install --wait operator intel/intel-device-plugins-operator && helm install gpu intel/intel-device-plugins-gpu --set nodeFeatureRule=true

I'm not familiar with flux so I don't know which way it functions.

Another thing to try is, when the GPU CR part has failed, wait a few seconds and try to create the GPU CR from the device plugins project:
curl 'https://raw.githubusercontent.com/intel/intel-device-plugins-for-kubernetes/v0.28.0/deployments/operator/samples/deviceplugin_v1_gpudeviceplugin.yaml' | kubectl create -f -

If the creation succeeds, then the underlying issue is about timing. If it still fails, it's something related to the environment which requires more debug.

eero-t commented

I have tried to reinstall the gpu plugin but I don't understand about the CR, from my testing, the custom resources definitions are installed with the operator helm charts. I install the gpu plugin helm charts after.

I'm not sure whether it's relevant here (Tuomas?), but Helm tool supports only (initial) CRD install, not CRD upgrades. AFAIK proper upgrade of changed CRDs would require them to be removed (manually) before using e.g. Helm to install new ones...

(Helm project has lot of tickets about that, and a long document about the corner-cases that are the reason why Helm tool chooses not to support CRD removal/upgrades.)

The CRD install doesn't seem to be the issue. The failure would be different.

Llyr95 commented

Ok so I did some testing

I tried to install the GPU CR manually like @tkatila said and I had the same error
Error from server (InternalError): error when creating "STDIN": Internal error occurred: failed calling webhook "mgpudeviceplugin.kb.io": failed to call webhook: Post "https://inteldeviceplugins-webhook-service.system.svc:443/mutate-deviceplugin-intel-com-v1-gpudeviceplugin?timeout=10s": context deadline exceeded

After removing everything, I tested helm install --wait operator intel/intel-device-plugins-operator --version=v0.28.0 -n system && helm install gpu intel/intel-device-plugins-gpu --set nodeFeatureRule=true --version=v0.28.0 -n system and again, the same problem.

At one point, I tried your first command to check (helm install operator intel/intel-device-plugins-operator && helm install gpu intel/intel-device-plugins-gpu --set nodeFeatureRule=true) and.... it worked

I didn't understand why so I figured it was because I forgot the version (I am on k8s v1.28.9) but it didn't make sense because if it would not work, it would be doing versions 0.29.0 on k8s 1.28.9 and not the other way around.

Afterwards, I found out that if I installed the operator in the system namespace, then I would have the error described. I still don't know why, it may be something in my configurations that I overlooked and I will dig deeper on that.

Thank you very much for your help!

Good that you got it working!

I don't understand why 'system' ns would cause the webhook to break. We typically use 'intel' or 'inteldeviceplugins' ns without issues.

As you are using Talos, have you decreased the pod-security for the default namespace? I recall that Talos has quite strict pod-security settings by default that can cause the issues with Pods not running. I wouldn't be surprised if there were some network access limitations as well.

In intel/helm-charts#46 we showed that the namespace does not matter. Anyway, closing.