Calls to Ruff format/check fail on Windows for large document sets
SrirachaHorse opened this issue · 3 comments
For sufficiently large XSD schemas that are split into multiple schemas or documents, this often produces a large number of corresponding dataclass files, in particular if using configuration options such as namespace-clusters
. In such cases, DataclassGenerator.ruff_code()
will be called on all of these files at once.
On Windows there is a character limit on CLI commands of approximately 32k. If this limit is exceeded by attempting to pass a large number of files to Ruff via subprocess.run()
, this will produce a Windows error and fail when generating dataclasses:
FileNotFoundError: [WinError 206] The filename or extension is too long
None of the input filenames individually exceed the Windows MAX_PATH limit, so this is an issue with the number of files attempting to be processed at once, rather than an issue with a particular file.
The regression appears to be a result of #1043, as that was the change that introduced the "all files at once" approach instead of running Ruff on each file separately.
@tefra I've considered a few solutions to this problem, so I would like to discuss a possible fix so that I can contribute. A few ideas I've thought of:
- If using Windows (or just in general), let the
ruff_code()
method run Ruff with smaller batches of filenames, rather than trying to run on every file at once (run ruff on the first 50 files, then the next 50, etc.) - Update the
DataclassGenerator.render()
method to instead build thefile_names
list from module names, rather than from individual files. This could be done by appending (e.g.)package_path.parent
to the list instead ofpackage_path
filename. Ruff will run formatting/checking recursively when given a directory, rather than needing to run from a large list of files.- This may have an adverse effect on certain package structures where there are already user-created Python files in module directories that should not be modified by Ruff. This fix would cause those files to now be formatted and checked by Ruff.
- Revert the change in #1043 and continue to run Ruff on each file individually.
One step forward, two steps back 🤦
The reason I went with the specific filenames is because people often, they add user stuff in the output folder, which isn't a great practise in my opinion.
The first option with the batches could work for everyone, but it would be tricky to test. I am wondering if we should go the hard way and run ruff on the entire output package, add a warning in the docs to prohibit people from adding code in the output package and be done with it.
That's my current inclination.
That sounds good to me @tefra, I think running on the whole output package is my preferred solution too. I'll work on a fix.