jendrikseipp/vulture

Multiprocessing vulture with python.

Opened this issue · 3 comments

Hi!

Is it possible to run vulture with multiprocessing?
if so do you have any recommendation on how to do it?

Trying the split the files in different processes/threads will break the tool, so it would be cool if its possible to share/split the AST between processes.
I'm using pythons threading and queue libs.

BR
Per

Vulture currently doesn't support parallel processing. And I feel that if it's slow enough to make somebody wish for parallel processing, we should be making its single-core execution faster first. Is the code you're testing open source and can you share a link and how you call Vulture? How ling does running Vulture on your code take?

Just to add to this, multiprocessing is only efficient in Python when it's I/O bound because of the GIL. So until something like PEP 734 or PEP 684 makes it into Python, I'm not sure this is feasible.

FYI, some time ago I contributed something similar for the autoflake CLI tool: PyCQA/autoflake#107.

I tried to take a look at vulture source code. As it currently stands, it's difficult to integrate multiprocessing into it, because the AST parsing logic is intertwined with the Vulture class.

Vulture.scan() (the method which should be parallelized), and the other methods which depend on it, should be turned into a function or staticmethod, independent from the Vulture class. It should accept one argument (either a path or code), traverse the entire AST tree and return a "final result" (a Python dict). You want to use standard Python types (dict, str, int, etc.) for both the function argument and return value, because they need to be serialized by multiprocessing.

That is, also the argparse namespace should be turned into a dict for the same reason.

In summary: adding multiprocessing is the easy part. The hard part is refactoring the Vulture class first. =)

@gtkacz wrote:

Just to add to this, multiprocessing is only efficient in Python when it's I/O bound because of the GIL.

It's the other way around actually. You want to use threading when the work is I/O bound, and multiprocessing when it's CPU bound. This sort of work is mostly CPU bound. Yes, vulture reads multiple files from disk, but that's fast already (.py files are small, disk read()s are cached, etc.). What's slow here is parsing the AST trees. That's the typical sort of work you want to split across multiple CPUs. The speedup will be massive.