python-jsonschema/check-jsonschema

Opening concurrent file handles for all instancefiles risks breaching OS limit.

Closed this issue · 6 comments

The use of the BinaryIO type for instancefiles causes file handles to be opened concurrently by Click for all instance files supplied at the command line.

A user may, for example, run a glob expansion on a number of targets that exceeds the limit of open files they are permitted by their operating system. Currently check-jsonschema fails in this case with OSError: [Errno 24] Too many open files. It would be nice if files were opened sequentially so that the user could supply a far greater number of arguments to a single command without error.

A minimal case for simulating this on Linux is as follows:

# Set open files limit to 1024 (this is a typical default on Linux,
#   but we do it explicitly to make the test repeatable):
ulimit -n 1024
mkdir /tmp/c-js
echo '{}' > /tmp/c-js/schema.js
for i in $(seq -f "%04g" 1024); do echo '{}' > /tmp/c-js/instance.$i.json; done;

check-jsonschema --schemafile /tmp/c-js/schema.js /tmp/c-js/instance.0001.json
# ok -- validation done

check-jsonschema --schemafile /tmp/c-js/schema.js /tmp/c-js/instance.*.json
# OSError: [Errno 24] Too many open files

Could it be possible to type the instancefiles argument as a str or Path and only open the file handles one by one as needed?

Thanks for the report/request!

It's certainly possible to do -- it is how earlier versions worked. I switched to the built-in click behavior when adding support for stdin, as it seemed like a nice simplification.

Before I pursue this, a quick question: are you running into this issue today, do you foresee running into it (soon), or is the issue purely theoretical?
This may be relevant for prioritization, and I may be pressed on it if I look to make any upstream contributions to click in support of it.

I'll want to look into click's capabilities to see if it can open the files lazily before pursuing a change here.

Thanks for the quick response. The issue already occurs for me on a project where I'm globbing around 2,000 files. However, it's simple to work around as users can normally raise the ulimit value to anything they like (eg ulimit -n 4096), so it's more a matter of convenience and tidy resource management than a critical issue.

I checked and click does have a lazy mode for file objects, but it rigs them up to close as part of the command exiting. I don't think that using it naively will solve this.

I'll have this near the front of my queue when I'm next able to devote some time to this project (likely in a little over a week).

Minor update:

I've been tinkering on this but it's not working quite as smoothly as I wanted. The click lazy file object does an eager open/close to catch some errors (e.g. permissions) early, which messes with some of the fifos in the testsuite.

It works in a live context, when I use a shell to write a fifo, so I must be doing something slightly wrong in the tests. I might also file a click bug report, since I think the open/close behavior is slightly unsafe for special file types.

I'm going to try to figure out how to fix the fifo tests, but I do have some options for handling this all in the worst case.

Tested and working perfectly. Thanks very much!

Wow, thanks for testing before I even cut the release! I was meaning to push this out yesterday but got a bit delayed.
It's great to hear that it's working for you as intended.

I've just dropped v0.27.3 with this included -- it's the primary update in there. Let me know if you see any issues with the released version. 😄