
Create fsnotify watcher: too many open files

Jeffwan opened this issue ยท 7 comments

/mnt/test-data-volume/kubeflow-kfserving-presubmit-e2e-1047-1a2fe6f-5248-8696/src/kubeflow/testing is at d4394ea
failed to create fsnotify watcher: too many open files

Reported by @yuzisun

Issue-Label Bot is automatically applying the labels:

Label Probability
kind/bug 0.94
area/testing 0.62

Please mark this comment with ๐Ÿ‘ or ๐Ÿ‘Ž to give our bot feedback!
Links: app homepage, dashboard and code for this bot.

File descriptors on the target node have been exhausted. The better way to handle this is to ssh to node and kill the process take many descriptors. The other easier way is to remove node from cluster and I think GCP node group will add one more into cluster.

This should not be reproduced all the time. Once the new test is scheduled on the other node, it should pass.

Issue-Label Bot is automatically applying the labels:

Label Probability
area/engprod 0.54

Please mark this comment with ๐Ÿ‘ or ๐Ÿ‘Ž to give our bot feedback!
Links: app homepage, dashboard and code for this bot.

container should inherit host settings. I think ci-team may not have node permissions, check sysctl value, we get following values

root@debug-worker-0:/mnt/test-data-volume# sysctl fs.inotify
fs.inotify.max_queued_events = 16384
fs.inotify.max_user_instances = 8192
fs.inotify.max_user_watches = 12288

There're only few pods are running.

  kubeflow-test-infra         kubeflow-kfserving-presubmit-e2e-1047-0894df4-5200-82e4-1022901040    0 (0%)        0 (0%)       0 (0%)           0 (0%)         11h
  kubeflow-test-infra         kubeflow-kfserving-presubmit-e2e-1047-1a2fe6f-8800-5024-4068610039    0 (0%)        0 (0%)       0 (0%)           0 (0%)         5h39m

Some workloads seems running fine a few minutes ago.

kubeflow-periodic-1-0-kfctl-upgrade-2273-0716-388587593                   0/2     Completed               0          35m
kubeflow-periodic-1-0-kfctl-upgrade-2273-0716-3898022393                  0/2     Completed               0          63m
kubeflow-periodic-1-0-kfctl-upgrade-2273-0716-395719757                   0/2     Completed               0          35m

@yuzisun Does KFServing test itself has some changes?

@Jeffwan It is possible that some of my previous change in PR kserve/kserve#1047 caused the test volume issue, not sure, but now I try other existing KFServing PRs it got the same issue.

Deleting above long running kfserving-presubmit test helps. Seem that's the processes having the issue. Thanks for the clues. @yuzisun. I will resolve this issue

Thanks @Jeffwan for the help!