Parallel find_packages_allowing_duplicates?
mikepurvis opened this issue · 4 comments
For large workspaces (600+ packages), the serialized parsing of package.xml
files in find_packages_allowing_duplicates
can start to be non-trivial:
catkin_pkg/src/catkin_pkg/packages.py
Lines 109 to 110 in a4cb118
Without changing the interface to the function, would we consider allowing this work to be spread over multiple threads or processes, possibly triggered by some threshold in number of packages?
Absolutely! The caller shouldn't care how the requested information is being gathered. If that loop can be parallelized that would be great.
A naive threading implementation is slower than the simple loop, so it's not IO bound. I get ~1.5s with the current implementation, and <0.5s running it with a multiprocessing map. I'll send a PR shortly.
The simplest implementation is like so:
package_paths = find_package_paths(basepath, exclude_paths=exclude_paths, exclude_subspaces=exclude_subspaces)
parsed_packages = multiprocessing.Pool(4).map(parse_package, package_paths)
return dict(zip(package_paths, parsed_packages))
However, to preserve the behaviour of the warnings
argument requires passing an extra thing into the map for it and then manually aggregating the results, which necessitates some additional wrapping, unfortunately.
Addressed by #171.