Create fsnotify watcher: too many open files
Jeffwan opened this issue ยท 7 comments
/mnt/test-data-volume/kubeflow-kfserving-presubmit-e2e-1047-1a2fe6f-5248-8696/src/kubeflow/testing is at d4394ea
failed to create fsnotify watcher: too many open files
Reported by @yuzisun
Issue-Label Bot is automatically applying the labels:
Label | Probability |
---|---|
kind/bug | 0.94 |
area/testing | 0.62 |
Please mark this comment with ๐ or ๐ to give our bot feedback!
Links: app homepage, dashboard and code for this bot.
File descriptors on the target node have been exhausted. The better way to handle this is to ssh to node and kill the process take many descriptors. The other easier way is to remove node from cluster and I think GCP node group will add one more into cluster.
This should not be reproduced all the time. Once the new test is scheduled on the other node, it should pass.
Issue-Label Bot is automatically applying the labels:
Label | Probability |
---|---|
area/engprod | 0.54 |
Please mark this comment with ๐ or ๐ to give our bot feedback!
Links: app homepage, dashboard and code for this bot.
container should inherit host settings. I think ci-team may not have node permissions, check sysctl value, we get following values
root@debug-worker-0:/mnt/test-data-volume# sysctl fs.inotify
fs.inotify.max_queued_events = 16384
fs.inotify.max_user_instances = 8192
fs.inotify.max_user_watches = 12288
There're only few pods are running.
kubeflow-test-infra kubeflow-kfserving-presubmit-e2e-1047-0894df4-5200-82e4-1022901040 0 (0%) 0 (0%) 0 (0%) 0 (0%) 11h
kubeflow-test-infra kubeflow-kfserving-presubmit-e2e-1047-1a2fe6f-8800-5024-4068610039 0 (0%) 0 (0%) 0 (0%) 0 (0%) 5h39m
Some workloads seems running fine a few minutes ago.
kubeflow-periodic-1-0-kfctl-upgrade-2273-0716-388587593 0/2 Completed 0 35m
kubeflow-periodic-1-0-kfctl-upgrade-2273-0716-3898022393 0/2 Completed 0 63m
kubeflow-periodic-1-0-kfctl-upgrade-2273-0716-395719757 0/2 Completed 0 35m
@yuzisun Does KFServing test itself has some changes?
@Jeffwan It is possible that some of my previous change in PR kserve/kserve#1047 caused the test volume issue, not sure, but now I try other existing KFServing PRs it got the same issue.