NVIDIA/gpu-operator

Error: failed to create FS watcher: too many open files

EajksEajks opened this issue · 6 comments

Hi.

I have an issue deploying the GPU operator v22.9.0 on a vanilla Kubernetes 1.35 running a PowerEdge R740 server with two Nvidia A30 cards. The nvidia-device-plugin-daemonset failed with the following error:

2022/11/14 14:20:20 Starting FS watcher.
2022/11/14 14:20:20 Error: failed to create FS watcher: too many open files

1. Quick Debug Checklist

  • Are you running on an Ubuntu 18.04 node? Ubuntu 20.04
  • Are you running Kubernetes v1.13+? Kubernetes 1.25.3
  • Are you running Docker (>= 18.06) or CRIO (>= 1.13+)? containerd.io 1.6.9-1
  • Do you have i2c_core and ipmi_msghandler loaded on the nodes? No idea what you're talking about...
  • Did you apply the CRD (kubectl describe clusterpolicies --all-namespaces)
Name:         cluster-policy
Namespace:
Labels:       app.kubernetes.io/component=gpu-operator
              app.kubernetes.io/managed-by=Helm
Annotations:  meta.helm.sh/release-name: gpu-operator
              meta.helm.sh/release-namespace: gpu-operator
API Version:  nvidia.com/v1
Kind:         ClusterPolicy
Metadata:
  Creation Timestamp:  2022-11-14T14:13:48Z
  Generation:          1
  Managed Fields:
    API Version:  nvidia.com/v1
    Fields Type:  FieldsV1
    fieldsV1:
      f:metadata:
        f:annotations:
          .:
          f:meta.helm.sh/release-name:
          f:meta.helm.sh/release-namespace:
        f:labels:
          .:
          f:app.kubernetes.io/component:
          f:app.kubernetes.io/managed-by:
      f:spec:
        .:
        f:daemonsets:
          .:
          f:priorityClassName:
          f:tolerations:
        f:dcgm:
          .:
          f:enabled:
          f:hostPort:
          f:image:
          f:imagePullPolicy:
          f:repository:
          f:version:
        f:dcgmExporter:
          .:
          f:enabled:
          f:env:
          f:image:
          f:imagePullPolicy:
          f:repository:
          f:serviceMonitor:
            .:
            f:additionalLabels:
            f:enabled:
            f:honorLabels:
            f:interval:
          f:version:
        f:devicePlugin:
          .:
          f:config:
            .:
            f:default:
            f:name:
          f:enabled:
          f:env:
          f:image:
          f:imagePullPolicy:
          f:repository:
          f:version:
        f:driver:
          .:
          f:certConfig:
            .:
            f:name:
          f:enabled:
          f:image:
          f:imagePullPolicy:
          f:kernelModuleConfig:
            .:
            f:name:
          f:licensingConfig:
            .:
            f:configMapName:
            f:nlsEnabled:
          f:manager:
            .:
            f:env:
            f:image:
            f:imagePullPolicy:
            f:repository:
            f:version:
          f:rdma:
            .:
            f:enabled:
            f:useHostMofed:
          f:repoConfig:
            .:
            f:configMapName:
          f:repository:
          f:rollingUpdate:
            .:
            f:maxUnavailable:
          f:version:
          f:virtualTopology:
            .:
            f:config:
        f:gfd:
          .:
          f:enabled:
          f:env:
          f:image:
          f:imagePullPolicy:
          f:repository:
          f:version:
        f:mig:
          .:
          f:strategy:
        f:migManager:
          .:
          f:config:
            .:
            f:name:
          f:enabled:
          f:env:
          f:gpuClientsConfig:
            .:
            f:name:
          f:image:
          f:imagePullPolicy:
          f:repository:
          f:version:
        f:nodeStatusExporter:
          .:
          f:enabled:
          f:image:
          f:imagePullPolicy:
          f:repository:
          f:version:
        f:operator:
          .:
          f:defaultRuntime:
          f:initContainer:
            .:
            f:image:
            f:imagePullPolicy:
            f:repository:
            f:version:
          f:runtimeClass:
        f:psp:
          .:
          f:enabled:
        f:sandboxDevicePlugin:
          .:
          f:enabled:
          f:image:
          f:imagePullPolicy:
          f:repository:
          f:version:
        f:sandboxWorkloads:
          .:
          f:defaultWorkload:
          f:enabled:
        f:toolkit:
          .:
          f:enabled:
          f:image:
          f:imagePullPolicy:
          f:installDir:
          f:repository:
          f:version:
        f:validator:
          .:
          f:image:
          f:imagePullPolicy:
          f:plugin:
            .:
            f:env:
          f:repository:
          f:version:
        f:vfioManager:
          .:
          f:driverManager:
            .:
            f:env:
            f:image:
            f:imagePullPolicy:
            f:repository:
            f:version:
          f:enabled:
          f:image:
          f:imagePullPolicy:
          f:repository:
          f:version:
        f:vgpuDeviceManager:
          .:
          f:config:
            .:
            f:default:
            f:name:
          f:enabled:
          f:image:
          f:imagePullPolicy:
          f:repository:
          f:version:
        f:vgpuManager:
          .:
          f:driverManager:
            .:
            f:env:
            f:image:
            f:imagePullPolicy:
            f:repository:
            f:version:
          f:enabled:
          f:image:
          f:imagePullPolicy:
    Manager:         helm
    Operation:       Update
    Time:            2022-11-14T14:13:48Z
  Resource Version:  1704158
  UID:               e678ca53-b692-4f5f-90b7-0209c266fd74
Spec:
  Daemonsets:
    Priority Class Name:  system-node-critical
    Tolerations:
      Effect:    NoSchedule
      Key:       nvidia.com/gpu
      Operator:  Exists
  Dcgm:
    Enabled:            false
    Host Port:          5555
    Image:              dcgm
    Image Pull Policy:  IfNotPresent
    Repository:         ********/nvcr.io/nvidia/cloud-native
    Version:            3.0.4-1-ubuntu20.04
  Dcgm Exporter:
    Enabled:  true
    Env:
      Name:             DCGM_EXPORTER_LISTEN
      Value:            :9400
      Name:             DCGM_EXPORTER_KUBERNETES
      Value:            true
      Name:             DCGM_EXPORTER_COLLECTORS
      Value:            /etc/dcgm-exporter/dcp-metrics-included.csv
    Image:              dcgm-exporter
    Image Pull Policy:  IfNotPresent
    Repository:         ********/nvcr.io/nvidia/k8s
    Service Monitor:
      Additional Labels:
      Enabled:       false
      Honor Labels:  false
      Interval:      15s
    Version:         3.0.4-3.0.0-ubuntu20.04
  Device Plugin:
    Config:
      Default:
      Name:
    Enabled:    true
    Env:
      Name:             PASS_DEVICE_SPECS
      Value:            true
      Name:             FAIL_ON_INIT_ERROR
      Value:            true
      Name:             DEVICE_LIST_STRATEGY
      Value:            envvar
      Name:             DEVICE_ID_STRATEGY
      Value:            uuid
      Name:             NVIDIA_VISIBLE_DEVICES
      Value:            all
      Name:             NVIDIA_DRIVER_CAPABILITIES
      Value:            all
    Image:              k8s-device-plugin
    Image Pull Policy:  IfNotPresent
    Repository:         ********/library/nvcr.io/nvidia
    Version:            v0.12.3-ubi8
  Driver:
    Cert Config:
      Name:
    Enabled:            false
    Image:              driver
    Image Pull Policy:  IfNotPresent
    Kernel Module Config:
      Name:
    Licensing Config:
      Config Map Name:
      Nls Enabled:      false
    Manager:
      Env:
        Name:             ENABLE_AUTO_DRAIN
        Value:            true
        Name:             DRAIN_USE_FORCE
        Value:            false
        Name:             DRAIN_POD_SELECTOR_LABEL
        Value:
        Name:             DRAIN_TIMEOUT_SECONDS
        Value:            0s
        Name:             DRAIN_DELETE_EMPTYDIR_DATA
        Value:            false
      Image:              k8s-driver-manager
      Image Pull Policy:  IfNotPresent
      Repository:         ********/nvcr.io/nvidia/cloud-native
      Version:            v0.4.2
    Rdma:
      Enabled:         false
      Use Host Mofed:  false
    Repo Config:
      Config Map Name:
    Repository:         ********/nvcr.io/nvidia
    Rolling Update:
      Max Unavailable:  1
    Version:            515.65.01
    Virtual Topology:
      Config:
  Gfd:
    Enabled:  true
    Env:
      Name:             GFD_SLEEP_INTERVAL
      Value:            60s
      Name:             GFD_FAIL_ON_INIT_ERROR
      Value:            true
    Image:              gpu-feature-discovery
    Image Pull Policy:  IfNotPresent
    Repository:         ********/nvcr.io/nvidia
    Version:            v0.6.2-ubi8
  Mig:
    Strategy:  single
  Mig Manager:
    Config:
      Name:
    Enabled:  true
    Env:
      Name:   WITH_REBOOT
      Value:  false
    Gpu Clients Config:
      Name:
    Image:              k8s-mig-manager
    Image Pull Policy:  IfNotPresent
    Repository:         ********/nvcr.io/nvidia/cloud-native
    Version:            v0.5.0-ubuntu20.04
  Node Status Exporter:
    Enabled:            false
    Image:              gpu-operator-validator
    Image Pull Policy:  IfNotPresent
    Repository:         ********/nvcr.io/nvidia/cloud-native
    Version:            v22.9.0
  Operator:
    Default Runtime:  docker
    Init Container:
      Image:              cuda
      Image Pull Policy:  IfNotPresent
      Repository:         ********/nvcr.io/nvidia
      Version:            11.7.1-base-ubi8
    Runtime Class:        nvidia
  Psp:
    Enabled:  false
  Sandbox Device Plugin:
    Enabled:            true
    Image:              kubevirt-gpu-device-plugin
    Image Pull Policy:  IfNotPresent
    Repository:         ********/nvcr.io/nvidia
    Version:            v1.2.1
  Sandbox Workloads:
    Default Workload:  container
    Enabled:           false
  Toolkit:
    Enabled:            false
    Image:              container-toolkit
    Image Pull Policy:  IfNotPresent
    Install Dir:        /usr/local/nvidia
    Repository:         ********/nvcr.io/nvidia/k8s
    Version:            v1.11.0-ubuntu20.04
  Validator:
    Image:              gpu-operator-validator
    Image Pull Policy:  IfNotPresent
    Plugin:
      Env:
        Name:    WITH_WORKLOAD
        Value:   true
    Repository:  ********/nvcr.io/nvidia/cloud-native
    Version:     v22.9.0
  Vfio Manager:
    Driver Manager:
      Env:
        Name:             ENABLE_AUTO_DRAIN
        Value:            false
      Image:              k8s-driver-manager
      Image Pull Policy:  IfNotPresent
      Repository:         ********/nvcr.io/nvidia/cloud-native
      Version:            v0.4.2
    Enabled:              true
    Image:                cuda
    Image Pull Policy:    IfNotPresent
    Repository:           ********/nvcr.io/nvidia
    Version:              11.7.1-base-ubi8
  Vgpu Device Manager:
    Config:
      Default:          default
      Name:
    Enabled:            true
    Image:              vgpu-device-manager
    Image Pull Policy:  IfNotPresent
    Repository:         ********/nvcr.io/nvidia/cloud-native
    Version:            v0.2.0
  Vgpu Manager:
    Driver Manager:
      Env:
        Name:             ENABLE_AUTO_DRAIN
        Value:            false
      Image:              k8s-driver-manager
      Image Pull Policy:  IfNotPresent
      Repository:         ********/nvcr.io/nvidia/cloud-native
      Version:            v0.4.2
    Enabled:              false
    Image:                vgpu-manager
    Image Pull Policy:    IfNotPresent
Events:                   <none>

1. Issue or feature description

The nvidia-device-plugin-daemonset failed with the following error:

2022/11/14 14:20:20 Starting FS watcher.
2022/11/14 14:20:20 Error: failed to create FS watcher: too many open files

2. Steps to reproduce the issue

It's clean installation of a vanilla Kubernetes 1.25.3 on a Dell PowerEdge R740 server with two Nvidia A30 cards. The whole system is air-gapped and because of sensitive info I can't tell much more about what we did.

Regarding the GPU operator itself, I ran the following instructions:

On the server itself:

sudo apt-get update
sudo apt-get install --yes nvidia-driver-515-server
sudo apt-get install --yes nvidia-container-runtime
sudo apt-get install --yes nvidia-container-toolkit

sudo nano /etc/containerd/config.toml
#   ...
#       [plugins."io.containerd.grpc.v1.cri".containerd]
#         default_runtime_name = "nvidia"
#   ...
#             [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc.options]
#               ...
#               SystemdCgroup = true
#   ...
#           [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia]
#             privileged_without_host_devices = false
#             runtime_engine = ""
#             runtime_root = ""
#             runtime_type = "io.containerd.runc.v2"
#
#             [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia.options]
#               BinaryName = "/usr/bin/nvidia-container-runtime"
#               SystemdCgroup = true
#   ...

sudo systemctl restart containerd
sudo systemctl status  containerd

On another server:

cd \path\to\1-OS-Configuration\Compute\6-GPU-Accelerators\GPU-Operator

helm --kubeconfig $KUBECONFIG install `
    gpu-operator `
    .\v22.9.0-customized `
    `
    --namespace gpu-operator `
    --create-namespace `
    --wait `
    --set driver.enabled=false `
    --set toolkit.enabled=false
#   NAME: gpu-operator
#   LAST DEPLOYED: Thu Nov 10 18:41:53 2022
#   NAMESPACE: gpu-operator
#   STATUS: deployed
#   REVISION: 1
#   TEST SUITE: None

The .\v22.9.0-customized directory is just the deployments/gpu-operator where all repositories have been replaced by a private registry where all required images are stored.

Note that the same procedure with gpu-operator v1.11.0 and kubernetes 1.23.9 ran without any issue.

3. Information to attach (optional if deemed irrelevant)

  • kubernetes pods status: kubectl get pods --all-namespaces
NAMESPACE              NAME                                                              READY   STATUS                  RESTARTS        AGE 
calico-apiserver       calico-apiserver-5b48fcdf8d-5h7d6                                 1/1     Running                 10 (22m ago)    5d6h
calico-apiserver       calico-apiserver-5b48fcdf8d-5m7hd                                 1/1     Running                 9 (22m ago)     5d6h
calico-system          calico-kube-controllers-5b456948d6-7wxdf                          1/1     Running                 9 (22m ago)     5d6h
calico-system          calico-node-lq2nq                                                 1/1     Running                 8 (22m ago)     5d1h
calico-system          calico-typha-6c887b44f7-j8jjz                                     1/1     Running                 12 (22m ago)    5d6h
ceph-csi               csi-rbdplugin-provisioner-79745b68d5-9lrh9                        7/7     Running                 49 (22m ago)    4d  
ceph-csi               csi-rbdplugin-x7pgr                                               3/3     Running                 18 (22m ago)    4d  
gpu-operator           gpu-feature-discovery-fvc98                                       1/1     Running                 0               20m 
gpu-operator           gpu-operator-5cfd867fdd-9gczh                                     1/1     Running                 0               21m
gpu-operator           gpu-operator-node-feature-discovery-master-54788ff856-5jz25       1/1     Running                 0               21m
gpu-operator           gpu-operator-node-feature-discovery-worker-xq7dc                  1/1     Running                 0               21m
gpu-operator           nvidia-cuda-validator-ck4f7                                       0/1     Completed               0               20m
gpu-operator           nvidia-dcgm-exporter-kjqwh                                        1/1     Running                 0               20m
gpu-operator           nvidia-device-plugin-daemonset-vwxgt                              0/1     CrashLoopBackOff        8 (4m28s ago)   20m
gpu-operator           nvidia-mig-manager-5dz9k                                          1/1     Running                 0               20m
gpu-operator           nvidia-operator-validator-t4h4f                                   0/1     Init:CrashLoopBackOff   5 (2m11s ago)   20m
kube-system            coredns-644b469dc9-vplt5                                          1/1     Running                 10 (22m ago)    5d7h
kube-system            coredns-644b469dc9-xrmh6                                          1/1     Running                 9 (22m ago)     5d7h
kube-system            etcd-*****************************                                1/1     Running                 10 (22m ago)    5d7h
kube-system            haproxy-*****************************                             1/1     Running                 10 (22m ago)    5d7h
kube-system            keepalived-*****************************                          1/1     Running                 10 (22m ago)    5d7h
kube-system            kube-apiserver-*****************************                      1/1     Running                 6 (22m ago)     4d4h
kube-system            kube-controller-manager-*****************************             1/1     Running                 10 (22m ago)    5d7h
kube-system            kube-proxy-8fp57                                                  1/1     Running                 10 (22m ago)    5d7h
kube-system            kube-scheduler-*****************************                      1/1     Running                 10 (22m ago)    5d7h
kubernetes-dashboard   dashboard-metrics-scraper-55db86c456-jdp9t                        1/1     Running                 7 (22m ago)     4d4h
kubernetes-dashboard   kubernetes-dashboard-7fbd9df566-jx26n                             1/1     Running                 7 (22m ago)     4d4h
metallb-system         controller-6f5956cc85-bbpgx                                       1/1     Running                 7 (22m ago)     4d8h
metallb-system         speaker-975bw                                                     1/1     Running                 12 (22m ago)    4d8h
rook-ceph              csi-cephfsplugin-k8z6t                                            2/2     Running                 14 (22m ago)    4d23h
rook-ceph              csi-cephfsplugin-provisioner-765c68d589-2jf82                     5/5     Running                 45 (22m ago)    4d23h
rook-ceph              csi-rbdplugin-provisioner-77ddd55848-vw86z                        5/5     Running                 35 (22m ago)    4d23h
rook-ceph              csi-rbdplugin-z7m7b                                               2/2     Running                 14 (22m ago)    4d23h
rook-ceph              rook-ceph-crashcollector-*****************************-746x8mq7   1/1     Running                 8 (22m ago)     4d23h
rook-ceph              rook-ceph-mds-*************************-644fb89f98-ktncf          2/2     Running                 18 (22m ago)    4d23h
rook-ceph              rook-ceph-mds-*************************-59cbdfc785-5b4th          0/2     Pending                 0               4d23h
rook-ceph              rook-ceph-mgr-a-696dfdbc85-mgkpb                                  2/2     Running                 14 (22m ago)    4d23h
rook-ceph              rook-ceph-mon-a-58d567dd4-mmrr7                                   2/2     Running                 18 (22m ago)    4d23h
rook-ceph              rook-ceph-operator-6fc456bfb5-bnvkq                               1/1     Running                 8 (22m ago)     5d
rook-ceph              rook-ceph-osd-0-6b69bbbb47-ppj9z                                  2/2     Running                 16 (22m ago)    4d23h
rook-ceph              rook-ceph-osd-prepare-*****************************-4jw49         0/1     Completed               0               20m
rook-ceph              rook-ceph-tools-7bc98fdd5f-mv4p6                                  1/1     Running                 7 (22m ago)     4d23h
tigera-operator        tigera-operator-858497bcb6-pjkcl                                  1/1     Running                 12 (22m ago)    5d6h
  • kubernetes daemonset status: kubectl get ds --all-namespaces
NAMESPACE        NAME                                         DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR                                      AGE  
calico-system    calico-node                                  1         1         1       1            1           kubernetes.io/os=linux                             5d6h 
calico-system    csi-node-driver                              0         0         0       0            0           kubernetes.io/os=linux                             5d6h 
ceph-csi         csi-rbdplugin                                1         1         1       1            1           <none>                                             4d   
gpu-operator     gpu-feature-discovery                        1         1         1       1            1           nvidia.com/gpu.deploy.gpu-feature-discovery=true   21m  
gpu-operator     gpu-operator-node-feature-discovery-worker   1         1         1       1            1           <none>                                             21m  
gpu-operator     nvidia-dcgm-exporter                         1         1         1       1            1           nvidia.com/gpu.deploy.dcgm-exporter=true           21m  
gpu-operator     nvidia-device-plugin-daemonset               1         1         0       1            0           nvidia.com/gpu.deploy.device-plugin=true           21m  
gpu-operator     nvidia-mig-manager                           1         1         1       1            1           nvidia.com/gpu.deploy.mig-manager=true             21m  
gpu-operator     nvidia-operator-validator                    1         1         0       1            0           nvidia.com/gpu.deploy.operator-validator=true      21m  
kube-system      kube-proxy                                   1         1         1       1            1           kubernetes.io/os=linux                             5d7h 
metallb-system   speaker                                      1         1         1       1            1           kubernetes.io/os=linux                             4d8h 
rook-ceph        csi-cephfsplugin                             1         1         1       1            1           <none>                                             4d23h
rook-ceph        csi-rbdplugin                                1         1         1       1            1           <none>                                             4d23h
  • If a pod/ds is in an error state or pending state kubectl describe pod -n NAMESPACE POD_NAME
$ kubectl --kubeconfig $KUBECONFIG describe pod -n gpu-operator nvidia-device-plugin-daemonset
Name:                 nvidia-device-plugin-daemonset-vwxgt
Namespace:            gpu-operator
Priority:             2000001000
Priority Class Name:  system-node-critical
Node:                 *****************************/10.0.0.1
Start Time:           Mon, 14 Nov 2022 15:14:09 +0100
Labels:               app=nvidia-device-plugin-daemonset
                      controller-revision-hash=5c6f9f9597
                      pod-template-generation=1
Annotations:          cni.projectcalico.org/containerID: 5c7d60815e6e752de8c738098630613efbd25781971c93ecbc77432857c64599
                      cni.projectcalico.org/podIP: 192.168.214.45/32
                      cni.projectcalico.org/podIPs: 192.168.214.45/32
Status:               Running
IP:                   192.168.214.45
IPs:
  IP:           192.168.214.45
Controlled By:  DaemonSet/nvidia-device-plugin-daemonset
Init Containers:
  toolkit-validation:
    Container ID:  containerd://2f670c9ca8c0e9631e6bb144ea246f8caa61546f249c4d91dfc2d11a17518006
    Image:         **********************************/library/nvcr.io/nvidia/cloud-native/gpu-operator-validator:v22.9.0
    Image ID:      **********************************/library/nvcr.io/nvidia/cloud-native/gpu-operator-validator@sha256:90fd8bb01d8089f900d35a699e0137599ac9de9f37e374eeb702fc90314af5bf
    Port:          <none>
    Host Port:     <none>
    Command:
      sh
      -c
    Args:
      until [ -f /run/nvidia/validations/toolkit-ready ]; do echo waiting for nvidia container stack to be setup; sleep 5; done
    State:          Terminated
      Reason:       Completed
      Exit Code:    0
      Started:      Mon, 14 Nov 2022 15:14:15 +0100
      Finished:     Mon, 14 Nov 2022 15:14:41 +0100
    Ready:          True
    Restart Count:  0
    Environment:    <none>
    Mounts:
      /run/nvidia from run-nvidia (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-7s76p (ro)
Containers:
  nvidia-device-plugin:
    Container ID:  containerd://74b772357c5fab89ff79c751c47ed19d143e40ccdfaf12eab090ed145fa9342b
    Image:         **********************************/library/nvcr.io/nvidia/k8s-device-plugin:v0.12.3-ubi8
    Image ID:      **********************************/library/nvcr.io/nvidia/k8s-device-plugin@sha256:a9c2cba87729fe625f647d8000b354aecf209b6e139f171d49ed06ff09f3c24a
    Port:          <none>
    Host Port:     <none>
    Command:
      bash
      -c
    Args:
      [[ -f /run/nvidia/validations/host-driver-ready ]] && driver_root=/ || driver_root=/run/nvidia/driver; export NVIDIA_DRIVER_ROOT=$driver_root; exec nvidia-device-plugin;
    State:          Waiting
      Reason:       CrashLoopBackOff
    Last State:     Terminated
      Reason:       Error
      Exit Code:    1
      Started:      Mon, 14 Nov 2022 15:35:46 +0100
      Finished:     Mon, 14 Nov 2022 15:35:46 +0100
    Ready:          False
    Restart Count:  9
    Environment:
      PASS_DEVICE_SPECS:           true
      FAIL_ON_INIT_ERROR:          true
      DEVICE_LIST_STRATEGY:        envvar
      DEVICE_ID_STRATEGY:          uuid
      NVIDIA_VISIBLE_DEVICES:      all
      NVIDIA_DRIVER_CAPABILITIES:  all
      MIG_STRATEGY:                single
      NVIDIA_MIG_MONITOR_DEVICES:  all
    Mounts:
      /run/nvidia from run-nvidia (rw)
      /var/lib/kubelet/device-plugins from device-plugin (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-7s76p (ro)
Conditions:
  Type              Status
  Initialized       True
  Ready             False
  ContainersReady   False
  PodScheduled      True
Volumes:
  device-plugin:
    Type:          HostPath (bare host directory volume)
    Path:          /var/lib/kubelet/device-plugins
    HostPathType:
  run-nvidia:
    Type:          HostPath (bare host directory volume)
    Path:          /run/nvidia
    HostPathType:  Directory
  kube-api-access-7s76p:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   BestEffort
Node-Selectors:              nvidia.com/gpu.deploy.device-plugin=true
Tolerations:                 node.kubernetes.io/disk-pressure:NoSchedule op=Exists
                             node.kubernetes.io/memory-pressure:NoSchedule op=Exists
                             node.kubernetes.io/not-ready:NoExecute op=Exists
                             node.kubernetes.io/pid-pressure:NoSchedule op=Exists
                             node.kubernetes.io/unreachable:NoExecute op=Exists
                             node.kubernetes.io/unschedulable:NoSchedule op=Exists
                             nvidia.com/gpu:NoSchedule op=Exists
Events:
  Type     Reason       Age                   From               Message
  ----     ------       ----                  ----               -------
  Normal   Scheduled    23m                   default-scheduler  Successfully assigned gpu-operator/nvidia-device-plugin-daemonset-vwxgt to *****************************
  Warning  FailedMount  23m                   kubelet            MountVolume.SetUp failed for volume "run-nvidia" : hostPath type check failed: /run/nvidia is not a directory
  Normal   Pulled       23m                   kubelet            Container image "**********************************/library/nvcr.io/nvidia/cloud-native/gpu-operator-validator:v22.9.0" already present on 
machine
  Normal   Created      23m                   kubelet            Created container toolkit-validation
  Normal   Started      23m                   kubelet            Started container toolkit-validation
  Normal   Pulled       21m (x5 over 23m)     kubelet            Container image "**********************************/library/nvcr.io/nvidia/k8s-device-plugin:v0.12.3-ubi8" already present on machine      
  Normal   Created      21m (x5 over 23m)     kubelet            Created container nvidia-device-plugin
  Normal   Started      21m (x5 over 23m)     kubelet            Started container nvidia-device-plugin
  Warning  BackOff      3m12s (x92 over 23m)  kubelet            Back-off restarting failed container

  • If a pod/ds is in an error state or pending state kubectl logs -n NAMESPACE POD_NAME
$ kubectl --kubeconfig $KUBECONFIG logs -n gpu-operator nvidia-device-plugin-daemonset-vwxgt        
2022/11/14 14:40:53 Starting FS watcher.
2022/11/14 14:40:53 Error: failed to create FS watcher: too many open files
  • Output of running a container on the GPU machine: docker run -it alpine echo foo docker not installed

  • Docker configuration file: cat /etc/docker/daemon.json docker not installed

  • Docker runtime configuration: docker info | grep runtime docker not installed

  • NVIDIA shared directory: ls -la /run/nvidia

$ ls -la /run/nvidia
total 0
drwxr-xr-x  4 root root   80 Nov 14 15:14 .
drwxr-xr-x 35 root root 1000 Nov 14 15:14 ..
drwxr-xr-x  2 root root   40 Nov 14 15:14 driver
drwxr-xr-x  2 root root  100 Nov 14 15:14 validations```
 - [x] NVIDIA packages directory: `ls -la /usr/local/nvidia/toolkit`

$ ls -la /usr/local/nvidia/toolkit/
total 12920
drwxr-xr-x 3 root root 4096 Nov 14 14:49 .
drwxr-xr-x 3 root root 4096 Nov 14 14:49 ..
drwxr-xr-x 3 root root 4096 Nov 14 14:49 .config
lrwxrwxrwx 1 root root 32 Nov 14 14:49 libnvidia-container-go.so.1 -> libnvidia-container-go.so.1.11.0
-rw-r--r-- 1 root root 2959384 Nov 14 14:49 libnvidia-container-go.so.1.11.0
lrwxrwxrwx 1 root root 29 Nov 14 14:49 libnvidia-container.so.1 -> libnvidia-container.so.1.11.0
-rwxr-xr-x 1 root root 195856 Nov 14 14:49 libnvidia-container.so.1.11.0
-rwxr-xr-x 1 root root 154 Nov 14 14:49 nvidia-container-cli
-rwxr-xr-x 1 root root 47472 Nov 14 14:49 nvidia-container-cli.real
-rwxr-xr-x 1 root root 342 Nov 14 14:49 nvidia-container-runtime
-rwxr-xr-x 1 root root 350 Nov 14 14:49 nvidia-container-runtime-experimental
-rwxr-xr-x 1 root root 3771792 Nov 14 14:49 nvidia-container-runtime.experimental
-rwxr-xr-x 1 root root 203 Nov 14 14:49 nvidia-container-runtime-hook
-rwxr-xr-x 1 root root 2142088 Nov 14 14:49 nvidia-container-runtime-hook.real
-rwxr-xr-x 1 root root 4079040 Nov 14 14:49 nvidia-container-runtime.real
lrwxrwxrwx 1 root root 29 Nov 14 14:49 nvidia-container-toolkit -> nvidia-container-runtime-hook

 - [x] NVIDIA driver directory: `ls -la /run/nvidia/driver`

$ ls -la /run/nvidia/driver/
total 0
drwxr-xr-x 2 root root 40 Nov 14 15:14 .
drwxr-xr-x 4 root root 80 Nov 14 15:14 ..

 - [ ] kubelet logs `journalctl -u kubelet > kubelet.logs`

@EajksEajks Can you check if the value of max_user_watches is set to too low?

sysctl -a | grep fs.inotify.max_user_

@shivamerla

For sure. Here it is:

fs.inotify.max_user_instances = 128
fs.inotify.max_user_watches = 8192

What would be a too low value?

Okay, no idea why I had to do the following given that I installed only a minimum set of services on a clean vanilla K8s installation, running itself on a clean vanilla Ubuntu 20.04 installation, given that it's nowhere mentionned in the Kubernetes or GPU-operator installation instructions, but it solved my problem (no idea either what good/bad values would be).

So I modified my original sysctl config from

cat << 'EOF' | sudo tee /etc/sysctl.d/99-kubernetes-cri.conf
net.bridge.bridge-nf-call-iptables = 1
net.bridge.bridge-nf-call-ip6tables = 1
net.ipv4.ip_forward = 1
EOF

to

cat << 'EOF' | sudo tee /etc/sysctl.d/99-kubernetes-cri.conf
fs.inotify.max_user_instances = 8192
fs.inotify.max_user_watches = 524288
net.bridge.bridge-nf-call-iptables = 1
net.bridge.bridge-nf-call-ip6tables = 1
net.ipv4.ip_forward = 1
EOF

Now it works.

You can run sudo find /proc/*/fd -lname anon_inode:inotify | cut -d/ -f3 | xargs -I '{}' -- ps --no-headers -o '%p %U %c' -p '{}' | uniq -c | sort -nr which will give all processes and count of files they are watching on using inotify event.

On ubuntu22.04 this is the value by default, so your earlier values seem too low.

fs.inotify.max_user_watches = 122425

@EajksEajks @shivamerla
Would you please help me find what file in which we need to change fs.inotify.max_ as I don't have 99-kubernetes-cri.conf file in my environment?

Tried to search all files under /etc/sysctl.d but could not find that string

#Ubuntu VERSION="22.04.2
#I am having docker running;
root@:/etc/sysctl.d# docker info | grep runtime
Runtimes: io.containerd.runc.v2 io.containerd.runtime.v1.linux runc

#root:/etc/sysctl.d# kubectl version
Client Version: version.Info{Major:"1", Minor:"21", GitVersion:"v1.21.9"

Quick fix, run command on node:

sysctl -w fs.inotify.max_user_watches=100000 
sysctl -w fs.inotify.max_user_instances=100000