tefra/xsdata

Calls to Ruff format/check fail on Windows for large document sets

SrirachaHorse opened this issue · 3 comments

For sufficiently large XSD schemas that are split into multiple schemas or documents, this often produces a large number of corresponding dataclass files, in particular if using configuration options such as namespace-clusters. In such cases, DataclassGenerator.ruff_code() will be called on all of these files at once.

On Windows there is a character limit on CLI commands of approximately 32k. If this limit is exceeded by attempting to pass a large number of files to Ruff via subprocess.run(), this will produce a Windows error and fail when generating dataclasses:

FileNotFoundError: [WinError 206] The filename or extension is too long

None of the input filenames individually exceed the Windows MAX_PATH limit, so this is an issue with the number of files attempting to be processed at once, rather than an issue with a particular file.

The regression appears to be a result of #1043, as that was the change that introduced the "all files at once" approach instead of running Ruff on each file separately.

@tefra I've considered a few solutions to this problem, so I would like to discuss a possible fix so that I can contribute. A few ideas I've thought of:

  1. If using Windows (or just in general), let the ruff_code() method run Ruff with smaller batches of filenames, rather than trying to run on every file at once (run ruff on the first 50 files, then the next 50, etc.)
  2. Update the DataclassGenerator.render() method to instead build the file_names list from module names, rather than from individual files. This could be done by appending (e.g.) package_path.parent to the list instead of package_path filename. Ruff will run formatting/checking recursively when given a directory, rather than needing to run from a large list of files.
    • This may have an adverse effect on certain package structures where there are already user-created Python files in module directories that should not be modified by Ruff. This fix would cause those files to now be formatted and checked by Ruff.
  3. Revert the change in #1043 and continue to run Ruff on each file individually.

One step forward, two steps back 🤦

The reason I went with the specific filenames is because people often, they add user stuff in the output folder, which isn't a great practise in my opinion.

The first option with the batches could work for everyone, but it would be tricky to test. I am wondering if we should go the hard way and run ruff on the entire output package, add a warning in the docs to prohibit people from adding code in the output package and be done with it.

That's my current inclination.

That sounds good to me @tefra, I think running on the whole output package is my preferred solution too. I'll work on a fix.