peterbe/hashin

Concurrent pypi lookups with --update-all

peterbe opened this issue · 6 comments

If a requirements file has 10 packages, you have to do 10 pypi.org lookups all in serial. When you use the --update-all --interactive that delay between each line is annoying.

I did a rough hack to get this to work and just jotting down some notes.
I have a requirements file with 53 packages listed.
I ran this:

time python hashin.py --dry-run --update-all --include-prereleases -r ~/songsearch/requirements.txt

The whole thing took 1.92s.
Also, for each line of r = urlopen(url) I put a little timer on these, dumped that to stdout and parsed the output. If you sum ALL the times it took to download, it would be 16.6 seconds.

The only bad thing so far is that there's an awkward little delay on the terminal whilst all this downloading is happening. You think nothing's happening. Like it's stuck. The verbose flag helps a little but that's not on my default. If you use --interactive it could be a nice place to inform about this.

Perhaps I'm over-worrying about the nothing-happens-till-all-is-downloaded. I just tried another file an the WHOLE thing took just 2 seconds. That requirements file had 79 packages listed and it took a total of 2 seconds to do 79 HTTP requests plus all the post-processing.

@mythmon @di What do you think about this? I haven't finished the work but it looks ^ promising. ~2 seconds to check 53 to 71 packages for updates. The core of it is this:

def pre_download_packages(memory, specs, verbose=False):
    futures = {}
    with concurrent.futures.ThreadPoolExecutor() as executor:
        for spec in specs:
            package, _, _ = _explode_package_spec(spec)
            req = Requirement(package)
            futures[
                executor.submit(get_package_data, req.name, verbose=verbose)
            ] = req.name
        for future in concurrent.futures.as_completed(futures):
            content = future.result()
            memory[futures[future]] = content

It basically populates a dict with the downloaded content so when it starts analyzing one package at a time, the download part of that can be skipped.

By doing all the downloads first, it makes sure the atomicity and the predictability of the interactive prompt stay intact.

I think the idea of prefetching the needed requests in interactive mode makes sense. I have very little experience with the new asyncio parts of Python, but the code in your latest comment seems fine to me.

the new asyncio parts of Python

I certainly have experience with it but saying I get it is like saying I get Linux.

The code I've got is not asyncio at all. Just good old regular threading.

I made it so that if you're on Python 2.7 you get the backport from pypi for it. Untested.
And I also made it so you can deliberately avoid this threading stuff if you know you really can't use it. E.g. hashin --update-all --synchronous

What I like with this is that it works in Py 2.7 and py 3 without any third-party libraries (except the backport for 2.7) and it's simple. It just does the download piece which is the only thing that can be significantly boosted because of the network IO.

I tested the error handling by messing with the spelling of a line in a requirements file (e.g. requestsXXX=2.20.1) and it immediately raised a nice exception and cleaned up the other threads.

A caveat is of course that the whole work is now basically at the mercy of the slowest download since we wait for ALL downloads to complete. Also, since it's threads there is a small chance that you saturate your network but since the individual network calls are tiny I'm not sure that's even a problem.