Error: failed to create FS watcher: too many open files
EajksEajks opened this issue · 6 comments
Hi.
I have an issue deploying the GPU operator v22.9.0 on a vanilla Kubernetes 1.35 running a PowerEdge R740 server with two Nvidia A30 cards. The nvidia-device-plugin-daemonset failed with the following error:
2022/11/14 14:20:20 Starting FS watcher.
2022/11/14 14:20:20 Error: failed to create FS watcher: too many open files
1. Quick Debug Checklist
- Are you running on an Ubuntu 18.04 node? Ubuntu 20.04
- Are you running Kubernetes v1.13+? Kubernetes 1.25.3
- Are you running Docker (>= 18.06) or CRIO (>= 1.13+)? containerd.io 1.6.9-1
- Do you have
i2c_core
andipmi_msghandler
loaded on the nodes? No idea what you're talking about... - Did you apply the CRD (
kubectl describe clusterpolicies --all-namespaces
)
Name: cluster-policy
Namespace:
Labels: app.kubernetes.io/component=gpu-operator
app.kubernetes.io/managed-by=Helm
Annotations: meta.helm.sh/release-name: gpu-operator
meta.helm.sh/release-namespace: gpu-operator
API Version: nvidia.com/v1
Kind: ClusterPolicy
Metadata:
Creation Timestamp: 2022-11-14T14:13:48Z
Generation: 1
Managed Fields:
API Version: nvidia.com/v1
Fields Type: FieldsV1
fieldsV1:
f:metadata:
f:annotations:
.:
f:meta.helm.sh/release-name:
f:meta.helm.sh/release-namespace:
f:labels:
.:
f:app.kubernetes.io/component:
f:app.kubernetes.io/managed-by:
f:spec:
.:
f:daemonsets:
.:
f:priorityClassName:
f:tolerations:
f:dcgm:
.:
f:enabled:
f:hostPort:
f:image:
f:imagePullPolicy:
f:repository:
f:version:
f:dcgmExporter:
.:
f:enabled:
f:env:
f:image:
f:imagePullPolicy:
f:repository:
f:serviceMonitor:
.:
f:additionalLabels:
f:enabled:
f:honorLabels:
f:interval:
f:version:
f:devicePlugin:
.:
f:config:
.:
f:default:
f:name:
f:enabled:
f:env:
f:image:
f:imagePullPolicy:
f:repository:
f:version:
f:driver:
.:
f:certConfig:
.:
f:name:
f:enabled:
f:image:
f:imagePullPolicy:
f:kernelModuleConfig:
.:
f:name:
f:licensingConfig:
.:
f:configMapName:
f:nlsEnabled:
f:manager:
.:
f:env:
f:image:
f:imagePullPolicy:
f:repository:
f:version:
f:rdma:
.:
f:enabled:
f:useHostMofed:
f:repoConfig:
.:
f:configMapName:
f:repository:
f:rollingUpdate:
.:
f:maxUnavailable:
f:version:
f:virtualTopology:
.:
f:config:
f:gfd:
.:
f:enabled:
f:env:
f:image:
f:imagePullPolicy:
f:repository:
f:version:
f:mig:
.:
f:strategy:
f:migManager:
.:
f:config:
.:
f:name:
f:enabled:
f:env:
f:gpuClientsConfig:
.:
f:name:
f:image:
f:imagePullPolicy:
f:repository:
f:version:
f:nodeStatusExporter:
.:
f:enabled:
f:image:
f:imagePullPolicy:
f:repository:
f:version:
f:operator:
.:
f:defaultRuntime:
f:initContainer:
.:
f:image:
f:imagePullPolicy:
f:repository:
f:version:
f:runtimeClass:
f:psp:
.:
f:enabled:
f:sandboxDevicePlugin:
.:
f:enabled:
f:image:
f:imagePullPolicy:
f:repository:
f:version:
f:sandboxWorkloads:
.:
f:defaultWorkload:
f:enabled:
f:toolkit:
.:
f:enabled:
f:image:
f:imagePullPolicy:
f:installDir:
f:repository:
f:version:
f:validator:
.:
f:image:
f:imagePullPolicy:
f:plugin:
.:
f:env:
f:repository:
f:version:
f:vfioManager:
.:
f:driverManager:
.:
f:env:
f:image:
f:imagePullPolicy:
f:repository:
f:version:
f:enabled:
f:image:
f:imagePullPolicy:
f:repository:
f:version:
f:vgpuDeviceManager:
.:
f:config:
.:
f:default:
f:name:
f:enabled:
f:image:
f:imagePullPolicy:
f:repository:
f:version:
f:vgpuManager:
.:
f:driverManager:
.:
f:env:
f:image:
f:imagePullPolicy:
f:repository:
f:version:
f:enabled:
f:image:
f:imagePullPolicy:
Manager: helm
Operation: Update
Time: 2022-11-14T14:13:48Z
Resource Version: 1704158
UID: e678ca53-b692-4f5f-90b7-0209c266fd74
Spec:
Daemonsets:
Priority Class Name: system-node-critical
Tolerations:
Effect: NoSchedule
Key: nvidia.com/gpu
Operator: Exists
Dcgm:
Enabled: false
Host Port: 5555
Image: dcgm
Image Pull Policy: IfNotPresent
Repository: ********/nvcr.io/nvidia/cloud-native
Version: 3.0.4-1-ubuntu20.04
Dcgm Exporter:
Enabled: true
Env:
Name: DCGM_EXPORTER_LISTEN
Value: :9400
Name: DCGM_EXPORTER_KUBERNETES
Value: true
Name: DCGM_EXPORTER_COLLECTORS
Value: /etc/dcgm-exporter/dcp-metrics-included.csv
Image: dcgm-exporter
Image Pull Policy: IfNotPresent
Repository: ********/nvcr.io/nvidia/k8s
Service Monitor:
Additional Labels:
Enabled: false
Honor Labels: false
Interval: 15s
Version: 3.0.4-3.0.0-ubuntu20.04
Device Plugin:
Config:
Default:
Name:
Enabled: true
Env:
Name: PASS_DEVICE_SPECS
Value: true
Name: FAIL_ON_INIT_ERROR
Value: true
Name: DEVICE_LIST_STRATEGY
Value: envvar
Name: DEVICE_ID_STRATEGY
Value: uuid
Name: NVIDIA_VISIBLE_DEVICES
Value: all
Name: NVIDIA_DRIVER_CAPABILITIES
Value: all
Image: k8s-device-plugin
Image Pull Policy: IfNotPresent
Repository: ********/library/nvcr.io/nvidia
Version: v0.12.3-ubi8
Driver:
Cert Config:
Name:
Enabled: false
Image: driver
Image Pull Policy: IfNotPresent
Kernel Module Config:
Name:
Licensing Config:
Config Map Name:
Nls Enabled: false
Manager:
Env:
Name: ENABLE_AUTO_DRAIN
Value: true
Name: DRAIN_USE_FORCE
Value: false
Name: DRAIN_POD_SELECTOR_LABEL
Value:
Name: DRAIN_TIMEOUT_SECONDS
Value: 0s
Name: DRAIN_DELETE_EMPTYDIR_DATA
Value: false
Image: k8s-driver-manager
Image Pull Policy: IfNotPresent
Repository: ********/nvcr.io/nvidia/cloud-native
Version: v0.4.2
Rdma:
Enabled: false
Use Host Mofed: false
Repo Config:
Config Map Name:
Repository: ********/nvcr.io/nvidia
Rolling Update:
Max Unavailable: 1
Version: 515.65.01
Virtual Topology:
Config:
Gfd:
Enabled: true
Env:
Name: GFD_SLEEP_INTERVAL
Value: 60s
Name: GFD_FAIL_ON_INIT_ERROR
Value: true
Image: gpu-feature-discovery
Image Pull Policy: IfNotPresent
Repository: ********/nvcr.io/nvidia
Version: v0.6.2-ubi8
Mig:
Strategy: single
Mig Manager:
Config:
Name:
Enabled: true
Env:
Name: WITH_REBOOT
Value: false
Gpu Clients Config:
Name:
Image: k8s-mig-manager
Image Pull Policy: IfNotPresent
Repository: ********/nvcr.io/nvidia/cloud-native
Version: v0.5.0-ubuntu20.04
Node Status Exporter:
Enabled: false
Image: gpu-operator-validator
Image Pull Policy: IfNotPresent
Repository: ********/nvcr.io/nvidia/cloud-native
Version: v22.9.0
Operator:
Default Runtime: docker
Init Container:
Image: cuda
Image Pull Policy: IfNotPresent
Repository: ********/nvcr.io/nvidia
Version: 11.7.1-base-ubi8
Runtime Class: nvidia
Psp:
Enabled: false
Sandbox Device Plugin:
Enabled: true
Image: kubevirt-gpu-device-plugin
Image Pull Policy: IfNotPresent
Repository: ********/nvcr.io/nvidia
Version: v1.2.1
Sandbox Workloads:
Default Workload: container
Enabled: false
Toolkit:
Enabled: false
Image: container-toolkit
Image Pull Policy: IfNotPresent
Install Dir: /usr/local/nvidia
Repository: ********/nvcr.io/nvidia/k8s
Version: v1.11.0-ubuntu20.04
Validator:
Image: gpu-operator-validator
Image Pull Policy: IfNotPresent
Plugin:
Env:
Name: WITH_WORKLOAD
Value: true
Repository: ********/nvcr.io/nvidia/cloud-native
Version: v22.9.0
Vfio Manager:
Driver Manager:
Env:
Name: ENABLE_AUTO_DRAIN
Value: false
Image: k8s-driver-manager
Image Pull Policy: IfNotPresent
Repository: ********/nvcr.io/nvidia/cloud-native
Version: v0.4.2
Enabled: true
Image: cuda
Image Pull Policy: IfNotPresent
Repository: ********/nvcr.io/nvidia
Version: 11.7.1-base-ubi8
Vgpu Device Manager:
Config:
Default: default
Name:
Enabled: true
Image: vgpu-device-manager
Image Pull Policy: IfNotPresent
Repository: ********/nvcr.io/nvidia/cloud-native
Version: v0.2.0
Vgpu Manager:
Driver Manager:
Env:
Name: ENABLE_AUTO_DRAIN
Value: false
Image: k8s-driver-manager
Image Pull Policy: IfNotPresent
Repository: ********/nvcr.io/nvidia/cloud-native
Version: v0.4.2
Enabled: false
Image: vgpu-manager
Image Pull Policy: IfNotPresent
Events: <none>
1. Issue or feature description
The nvidia-device-plugin-daemonset failed with the following error:
2022/11/14 14:20:20 Starting FS watcher.
2022/11/14 14:20:20 Error: failed to create FS watcher: too many open files
2. Steps to reproduce the issue
It's clean installation of a vanilla Kubernetes 1.25.3 on a Dell PowerEdge R740 server with two Nvidia A30 cards. The whole system is air-gapped and because of sensitive info I can't tell much more about what we did.
Regarding the GPU operator itself, I ran the following instructions:
On the server itself:
sudo apt-get update
sudo apt-get install --yes nvidia-driver-515-server
sudo apt-get install --yes nvidia-container-runtime
sudo apt-get install --yes nvidia-container-toolkit
sudo nano /etc/containerd/config.toml
# ...
# [plugins."io.containerd.grpc.v1.cri".containerd]
# default_runtime_name = "nvidia"
# ...
# [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc.options]
# ...
# SystemdCgroup = true
# ...
# [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia]
# privileged_without_host_devices = false
# runtime_engine = ""
# runtime_root = ""
# runtime_type = "io.containerd.runc.v2"
#
# [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia.options]
# BinaryName = "/usr/bin/nvidia-container-runtime"
# SystemdCgroup = true
# ...
sudo systemctl restart containerd
sudo systemctl status containerd
On another server:
cd \path\to\1-OS-Configuration\Compute\6-GPU-Accelerators\GPU-Operator
helm --kubeconfig $KUBECONFIG install `
gpu-operator `
.\v22.9.0-customized `
`
--namespace gpu-operator `
--create-namespace `
--wait `
--set driver.enabled=false `
--set toolkit.enabled=false
# NAME: gpu-operator
# LAST DEPLOYED: Thu Nov 10 18:41:53 2022
# NAMESPACE: gpu-operator
# STATUS: deployed
# REVISION: 1
# TEST SUITE: None
The .\v22.9.0-customized directory is just the deployments/gpu-operator where all repositories have been replaced by a private registry where all required images are stored.
Note that the same procedure with gpu-operator v1.11.0 and kubernetes 1.23.9 ran without any issue.
3. Information to attach (optional if deemed irrelevant)
- kubernetes pods status:
kubectl get pods --all-namespaces
NAMESPACE NAME READY STATUS RESTARTS AGE
calico-apiserver calico-apiserver-5b48fcdf8d-5h7d6 1/1 Running 10 (22m ago) 5d6h
calico-apiserver calico-apiserver-5b48fcdf8d-5m7hd 1/1 Running 9 (22m ago) 5d6h
calico-system calico-kube-controllers-5b456948d6-7wxdf 1/1 Running 9 (22m ago) 5d6h
calico-system calico-node-lq2nq 1/1 Running 8 (22m ago) 5d1h
calico-system calico-typha-6c887b44f7-j8jjz 1/1 Running 12 (22m ago) 5d6h
ceph-csi csi-rbdplugin-provisioner-79745b68d5-9lrh9 7/7 Running 49 (22m ago) 4d
ceph-csi csi-rbdplugin-x7pgr 3/3 Running 18 (22m ago) 4d
gpu-operator gpu-feature-discovery-fvc98 1/1 Running 0 20m
gpu-operator gpu-operator-5cfd867fdd-9gczh 1/1 Running 0 21m
gpu-operator gpu-operator-node-feature-discovery-master-54788ff856-5jz25 1/1 Running 0 21m
gpu-operator gpu-operator-node-feature-discovery-worker-xq7dc 1/1 Running 0 21m
gpu-operator nvidia-cuda-validator-ck4f7 0/1 Completed 0 20m
gpu-operator nvidia-dcgm-exporter-kjqwh 1/1 Running 0 20m
gpu-operator nvidia-device-plugin-daemonset-vwxgt 0/1 CrashLoopBackOff 8 (4m28s ago) 20m
gpu-operator nvidia-mig-manager-5dz9k 1/1 Running 0 20m
gpu-operator nvidia-operator-validator-t4h4f 0/1 Init:CrashLoopBackOff 5 (2m11s ago) 20m
kube-system coredns-644b469dc9-vplt5 1/1 Running 10 (22m ago) 5d7h
kube-system coredns-644b469dc9-xrmh6 1/1 Running 9 (22m ago) 5d7h
kube-system etcd-***************************** 1/1 Running 10 (22m ago) 5d7h
kube-system haproxy-***************************** 1/1 Running 10 (22m ago) 5d7h
kube-system keepalived-***************************** 1/1 Running 10 (22m ago) 5d7h
kube-system kube-apiserver-***************************** 1/1 Running 6 (22m ago) 4d4h
kube-system kube-controller-manager-***************************** 1/1 Running 10 (22m ago) 5d7h
kube-system kube-proxy-8fp57 1/1 Running 10 (22m ago) 5d7h
kube-system kube-scheduler-***************************** 1/1 Running 10 (22m ago) 5d7h
kubernetes-dashboard dashboard-metrics-scraper-55db86c456-jdp9t 1/1 Running 7 (22m ago) 4d4h
kubernetes-dashboard kubernetes-dashboard-7fbd9df566-jx26n 1/1 Running 7 (22m ago) 4d4h
metallb-system controller-6f5956cc85-bbpgx 1/1 Running 7 (22m ago) 4d8h
metallb-system speaker-975bw 1/1 Running 12 (22m ago) 4d8h
rook-ceph csi-cephfsplugin-k8z6t 2/2 Running 14 (22m ago) 4d23h
rook-ceph csi-cephfsplugin-provisioner-765c68d589-2jf82 5/5 Running 45 (22m ago) 4d23h
rook-ceph csi-rbdplugin-provisioner-77ddd55848-vw86z 5/5 Running 35 (22m ago) 4d23h
rook-ceph csi-rbdplugin-z7m7b 2/2 Running 14 (22m ago) 4d23h
rook-ceph rook-ceph-crashcollector-*****************************-746x8mq7 1/1 Running 8 (22m ago) 4d23h
rook-ceph rook-ceph-mds-*************************-644fb89f98-ktncf 2/2 Running 18 (22m ago) 4d23h
rook-ceph rook-ceph-mds-*************************-59cbdfc785-5b4th 0/2 Pending 0 4d23h
rook-ceph rook-ceph-mgr-a-696dfdbc85-mgkpb 2/2 Running 14 (22m ago) 4d23h
rook-ceph rook-ceph-mon-a-58d567dd4-mmrr7 2/2 Running 18 (22m ago) 4d23h
rook-ceph rook-ceph-operator-6fc456bfb5-bnvkq 1/1 Running 8 (22m ago) 5d
rook-ceph rook-ceph-osd-0-6b69bbbb47-ppj9z 2/2 Running 16 (22m ago) 4d23h
rook-ceph rook-ceph-osd-prepare-*****************************-4jw49 0/1 Completed 0 20m
rook-ceph rook-ceph-tools-7bc98fdd5f-mv4p6 1/1 Running 7 (22m ago) 4d23h
tigera-operator tigera-operator-858497bcb6-pjkcl 1/1 Running 12 (22m ago) 5d6h
- kubernetes daemonset status:
kubectl get ds --all-namespaces
NAMESPACE NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE
calico-system calico-node 1 1 1 1 1 kubernetes.io/os=linux 5d6h
calico-system csi-node-driver 0 0 0 0 0 kubernetes.io/os=linux 5d6h
ceph-csi csi-rbdplugin 1 1 1 1 1 <none> 4d
gpu-operator gpu-feature-discovery 1 1 1 1 1 nvidia.com/gpu.deploy.gpu-feature-discovery=true 21m
gpu-operator gpu-operator-node-feature-discovery-worker 1 1 1 1 1 <none> 21m
gpu-operator nvidia-dcgm-exporter 1 1 1 1 1 nvidia.com/gpu.deploy.dcgm-exporter=true 21m
gpu-operator nvidia-device-plugin-daemonset 1 1 0 1 0 nvidia.com/gpu.deploy.device-plugin=true 21m
gpu-operator nvidia-mig-manager 1 1 1 1 1 nvidia.com/gpu.deploy.mig-manager=true 21m
gpu-operator nvidia-operator-validator 1 1 0 1 0 nvidia.com/gpu.deploy.operator-validator=true 21m
kube-system kube-proxy 1 1 1 1 1 kubernetes.io/os=linux 5d7h
metallb-system speaker 1 1 1 1 1 kubernetes.io/os=linux 4d8h
rook-ceph csi-cephfsplugin 1 1 1 1 1 <none> 4d23h
rook-ceph csi-rbdplugin 1 1 1 1 1 <none> 4d23h
- If a pod/ds is in an error state or pending state
kubectl describe pod -n NAMESPACE POD_NAME
$ kubectl --kubeconfig $KUBECONFIG describe pod -n gpu-operator nvidia-device-plugin-daemonset
Name: nvidia-device-plugin-daemonset-vwxgt
Namespace: gpu-operator
Priority: 2000001000
Priority Class Name: system-node-critical
Node: *****************************/10.0.0.1
Start Time: Mon, 14 Nov 2022 15:14:09 +0100
Labels: app=nvidia-device-plugin-daemonset
controller-revision-hash=5c6f9f9597
pod-template-generation=1
Annotations: cni.projectcalico.org/containerID: 5c7d60815e6e752de8c738098630613efbd25781971c93ecbc77432857c64599
cni.projectcalico.org/podIP: 192.168.214.45/32
cni.projectcalico.org/podIPs: 192.168.214.45/32
Status: Running
IP: 192.168.214.45
IPs:
IP: 192.168.214.45
Controlled By: DaemonSet/nvidia-device-plugin-daemonset
Init Containers:
toolkit-validation:
Container ID: containerd://2f670c9ca8c0e9631e6bb144ea246f8caa61546f249c4d91dfc2d11a17518006
Image: **********************************/library/nvcr.io/nvidia/cloud-native/gpu-operator-validator:v22.9.0
Image ID: **********************************/library/nvcr.io/nvidia/cloud-native/gpu-operator-validator@sha256:90fd8bb01d8089f900d35a699e0137599ac9de9f37e374eeb702fc90314af5bf
Port: <none>
Host Port: <none>
Command:
sh
-c
Args:
until [ -f /run/nvidia/validations/toolkit-ready ]; do echo waiting for nvidia container stack to be setup; sleep 5; done
State: Terminated
Reason: Completed
Exit Code: 0
Started: Mon, 14 Nov 2022 15:14:15 +0100
Finished: Mon, 14 Nov 2022 15:14:41 +0100
Ready: True
Restart Count: 0
Environment: <none>
Mounts:
/run/nvidia from run-nvidia (rw)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-7s76p (ro)
Containers:
nvidia-device-plugin:
Container ID: containerd://74b772357c5fab89ff79c751c47ed19d143e40ccdfaf12eab090ed145fa9342b
Image: **********************************/library/nvcr.io/nvidia/k8s-device-plugin:v0.12.3-ubi8
Image ID: **********************************/library/nvcr.io/nvidia/k8s-device-plugin@sha256:a9c2cba87729fe625f647d8000b354aecf209b6e139f171d49ed06ff09f3c24a
Port: <none>
Host Port: <none>
Command:
bash
-c
Args:
[[ -f /run/nvidia/validations/host-driver-ready ]] && driver_root=/ || driver_root=/run/nvidia/driver; export NVIDIA_DRIVER_ROOT=$driver_root; exec nvidia-device-plugin;
State: Waiting
Reason: CrashLoopBackOff
Last State: Terminated
Reason: Error
Exit Code: 1
Started: Mon, 14 Nov 2022 15:35:46 +0100
Finished: Mon, 14 Nov 2022 15:35:46 +0100
Ready: False
Restart Count: 9
Environment:
PASS_DEVICE_SPECS: true
FAIL_ON_INIT_ERROR: true
DEVICE_LIST_STRATEGY: envvar
DEVICE_ID_STRATEGY: uuid
NVIDIA_VISIBLE_DEVICES: all
NVIDIA_DRIVER_CAPABILITIES: all
MIG_STRATEGY: single
NVIDIA_MIG_MONITOR_DEVICES: all
Mounts:
/run/nvidia from run-nvidia (rw)
/var/lib/kubelet/device-plugins from device-plugin (rw)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-7s76p (ro)
Conditions:
Type Status
Initialized True
Ready False
ContainersReady False
PodScheduled True
Volumes:
device-plugin:
Type: HostPath (bare host directory volume)
Path: /var/lib/kubelet/device-plugins
HostPathType:
run-nvidia:
Type: HostPath (bare host directory volume)
Path: /run/nvidia
HostPathType: Directory
kube-api-access-7s76p:
Type: Projected (a volume that contains injected data from multiple sources)
TokenExpirationSeconds: 3607
ConfigMapName: kube-root-ca.crt
ConfigMapOptional: <nil>
DownwardAPI: true
QoS Class: BestEffort
Node-Selectors: nvidia.com/gpu.deploy.device-plugin=true
Tolerations: node.kubernetes.io/disk-pressure:NoSchedule op=Exists
node.kubernetes.io/memory-pressure:NoSchedule op=Exists
node.kubernetes.io/not-ready:NoExecute op=Exists
node.kubernetes.io/pid-pressure:NoSchedule op=Exists
node.kubernetes.io/unreachable:NoExecute op=Exists
node.kubernetes.io/unschedulable:NoSchedule op=Exists
nvidia.com/gpu:NoSchedule op=Exists
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 23m default-scheduler Successfully assigned gpu-operator/nvidia-device-plugin-daemonset-vwxgt to *****************************
Warning FailedMount 23m kubelet MountVolume.SetUp failed for volume "run-nvidia" : hostPath type check failed: /run/nvidia is not a directory
Normal Pulled 23m kubelet Container image "**********************************/library/nvcr.io/nvidia/cloud-native/gpu-operator-validator:v22.9.0" already present on
machine
Normal Created 23m kubelet Created container toolkit-validation
Normal Started 23m kubelet Started container toolkit-validation
Normal Pulled 21m (x5 over 23m) kubelet Container image "**********************************/library/nvcr.io/nvidia/k8s-device-plugin:v0.12.3-ubi8" already present on machine
Normal Created 21m (x5 over 23m) kubelet Created container nvidia-device-plugin
Normal Started 21m (x5 over 23m) kubelet Started container nvidia-device-plugin
Warning BackOff 3m12s (x92 over 23m) kubelet Back-off restarting failed container
- If a pod/ds is in an error state or pending state
kubectl logs -n NAMESPACE POD_NAME
$ kubectl --kubeconfig $KUBECONFIG logs -n gpu-operator nvidia-device-plugin-daemonset-vwxgt
2022/11/14 14:40:53 Starting FS watcher.
2022/11/14 14:40:53 Error: failed to create FS watcher: too many open files
-
Output of running a container on the GPU machine:
docker run -it alpine echo foo
docker not installed -
Docker configuration file:
cat /etc/docker/daemon.json
docker not installed -
Docker runtime configuration:
docker info | grep runtime
docker not installed -
NVIDIA shared directory:
ls -la /run/nvidia
$ ls -la /run/nvidia
total 0
drwxr-xr-x 4 root root 80 Nov 14 15:14 .
drwxr-xr-x 35 root root 1000 Nov 14 15:14 ..
drwxr-xr-x 2 root root 40 Nov 14 15:14 driver
drwxr-xr-x 2 root root 100 Nov 14 15:14 validations```
- [x] NVIDIA packages directory: `ls -la /usr/local/nvidia/toolkit`
$ ls -la /usr/local/nvidia/toolkit/
total 12920
drwxr-xr-x 3 root root 4096 Nov 14 14:49 .
drwxr-xr-x 3 root root 4096 Nov 14 14:49 ..
drwxr-xr-x 3 root root 4096 Nov 14 14:49 .config
lrwxrwxrwx 1 root root 32 Nov 14 14:49 libnvidia-container-go.so.1 -> libnvidia-container-go.so.1.11.0
-rw-r--r-- 1 root root 2959384 Nov 14 14:49 libnvidia-container-go.so.1.11.0
lrwxrwxrwx 1 root root 29 Nov 14 14:49 libnvidia-container.so.1 -> libnvidia-container.so.1.11.0
-rwxr-xr-x 1 root root 195856 Nov 14 14:49 libnvidia-container.so.1.11.0
-rwxr-xr-x 1 root root 154 Nov 14 14:49 nvidia-container-cli
-rwxr-xr-x 1 root root 47472 Nov 14 14:49 nvidia-container-cli.real
-rwxr-xr-x 1 root root 342 Nov 14 14:49 nvidia-container-runtime
-rwxr-xr-x 1 root root 350 Nov 14 14:49 nvidia-container-runtime-experimental
-rwxr-xr-x 1 root root 3771792 Nov 14 14:49 nvidia-container-runtime.experimental
-rwxr-xr-x 1 root root 203 Nov 14 14:49 nvidia-container-runtime-hook
-rwxr-xr-x 1 root root 2142088 Nov 14 14:49 nvidia-container-runtime-hook.real
-rwxr-xr-x 1 root root 4079040 Nov 14 14:49 nvidia-container-runtime.real
lrwxrwxrwx 1 root root 29 Nov 14 14:49 nvidia-container-toolkit -> nvidia-container-runtime-hook
- [x] NVIDIA driver directory: `ls -la /run/nvidia/driver`
$ ls -la /run/nvidia/driver/
total 0
drwxr-xr-x 2 root root 40 Nov 14 15:14 .
drwxr-xr-x 4 root root 80 Nov 14 15:14 ..
- [ ] kubelet logs `journalctl -u kubelet > kubelet.logs`
@EajksEajks Can you check if the value of max_user_watches
is set to too low?
sysctl -a | grep fs.inotify.max_user_
For sure. Here it is:
fs.inotify.max_user_instances = 128
fs.inotify.max_user_watches = 8192
What would be a too low value?
Okay, no idea why I had to do the following given that I installed only a minimum set of services on a clean vanilla K8s installation, running itself on a clean vanilla Ubuntu 20.04 installation, given that it's nowhere mentionned in the Kubernetes or GPU-operator installation instructions, but it solved my problem (no idea either what good/bad values would be).
So I modified my original sysctl config from
cat << 'EOF' | sudo tee /etc/sysctl.d/99-kubernetes-cri.conf
net.bridge.bridge-nf-call-iptables = 1
net.bridge.bridge-nf-call-ip6tables = 1
net.ipv4.ip_forward = 1
EOF
to
cat << 'EOF' | sudo tee /etc/sysctl.d/99-kubernetes-cri.conf
fs.inotify.max_user_instances = 8192
fs.inotify.max_user_watches = 524288
net.bridge.bridge-nf-call-iptables = 1
net.bridge.bridge-nf-call-ip6tables = 1
net.ipv4.ip_forward = 1
EOF
Now it works.
You can run sudo find /proc/*/fd -lname anon_inode:inotify | cut -d/ -f3 | xargs -I '{}' -- ps --no-headers -o '%p %U %c' -p '{}' | uniq -c | sort -nr
which will give all processes and count of files they are watching on using inotify event.
On ubuntu22.04 this is the value by default, so your earlier values seem too low.
fs.inotify.max_user_watches = 122425
@EajksEajks @shivamerla
Would you please help me find what file in which we need to change fs.inotify.max_ as I don't have 99-kubernetes-cri.conf file in my environment?
Tried to search all files under /etc/sysctl.d but could not find that string
#Ubuntu VERSION="22.04.2
#I am having docker running;
root@:/etc/sysctl.d# docker info | grep runtime
Runtimes: io.containerd.runc.v2 io.containerd.runtime.v1.linux runc
#root:/etc/sysctl.d# kubectl version
Client Version: version.Info{Major:"1", Minor:"21", GitVersion:"v1.21.9"
Quick fix, run command on node:
sysctl -w fs.inotify.max_user_watches=100000
sysctl -w fs.inotify.max_user_instances=100000