Multiprocessing vulture with python.
Opened this issue · 3 comments
Hi!
Is it possible to run vulture with multiprocessing?
if so do you have any recommendation on how to do it?
Trying the split the files in different processes/threads will break the tool, so it would be cool if its possible to share/split the AST between processes.
I'm using pythons threading and queue libs.
BR
Per
Vulture currently doesn't support parallel processing. And I feel that if it's slow enough to make somebody wish for parallel processing, we should be making its single-core execution faster first. Is the code you're testing open source and can you share a link and how you call Vulture? How ling does running Vulture on your code take?
FYI, some time ago I contributed something similar for the autoflake CLI tool: PyCQA/autoflake#107.
I tried to take a look at vulture source code. As it currently stands, it's difficult to integrate multiprocessing
into it, because the AST parsing logic is intertwined with the Vulture class.
Vulture.scan()
(the method which should be parallelized), and the other methods which depend on it, should be turned into a function or staticmethod, independent from the Vulture
class. It should accept one argument (either a path
or code
), traverse the entire AST tree and return a "final result" (a Python dict). You want to use standard Python types (dict, str, int, etc.) for both the function argument and return value, because they need to be serialized by multiprocessing
.
That is, also the argparse
namespace should be turned into a dict for the same reason.
In summary: adding multiprocessing
is the easy part. The hard part is refactoring the Vulture
class first. =)
@gtkacz wrote:
Just to add to this, multiprocessing is only efficient in Python when it's I/O bound because of the GIL.
It's the other way around actually. You want to use threading
when the work is I/O bound, and multiprocessing
when it's CPU bound. This sort of work is mostly CPU bound. Yes, vulture
reads multiple files from disk, but that's fast already (.py files are small, disk read()
s are cached, etc.). What's slow here is parsing the AST trees. That's the typical sort of work you want to split across multiple CPUs. The speedup will be massive.