MacPython/terryfy

guard against incomplete downloads?

Opened this issue · 6 comments

ev-br commented

When used for downloading the wheels built via MacPython/scipy-wheels, if

$ python terryfy/wheelhouse-uploader -n ...

hits e.g. a network timeout, it leaves behind an incomplete wheel. (No idea what would happen without the -n switch, would it upload a broken thing to PyPI?)

An issue here is that a user (an RM for some package, supposedly) does not have an easy way of checking if a downloaded wheels is OK or not. This is not a hypothetical scenario --- I hit it when trying to do a release from a place with a flaky internet. These issues were discussed over in this thread, https://mail.scipy.org/pipermail/scipy-dev/2016-June/021384.html, from the security angle, but here I think the failure mode is not malice, but rather just network timeouts.

ISTM a way to guard against these is to checksum the wheels on the build farm, upload the checksums to the Rackspace container and have a way of checking these in terryfy/wheelhouse_uploader, either as a part of a normal operation, or as a special action.

Yes, that's a good idea. How about an extra step in the build scripts to calculate the sha256 checksums before upload, making a file like numpy-1.11.1-cp27-cp27mu-manylinux1_x86_64.whl.sha256, one per wheel, then have wheelhouse-uploader look for the matching sha256 file, and check against that?

ev-br commented

Yup, that'd be perfect!

@ogrisel - looking at it - maybe wheelhouse-uploader is the right place to do the checksum calculation and upload? What do you think about an optional flag like --hashfile to generate (probably) MD5 sum files. I'm thinking MD5 because of course the shasum doesn't protect you from compromise of the upload site, but the presence of the shasum might imply that it does.

wheelhouse-uploader already computes sha256 hexdigests and put them both in a custom json file and in the generated html page href URLs so that pip actually check those automatically in case of corrupted downloads.

Just to check that I understand - wheelhouse-uploader does this (index / json hashes) unless you use the --no-update-index flag. Multibuild does actually use this flag, partly because I didn't want to change the form of the directory listing of the pre-existing upload directory, in case anyone changed it, and partly because I was worried that two simultaneous uploads would trample on each other.

It's nice that pip will check the hash files from the html page, but that's only when installing directly from that URL as an index, but here we're worrying about downloads to a local machine before uploads to pypi. Does the wheelhouse-uploader fetch command check the hashes?

What do you think about an option for individual hash files for the no-index case?

Does the wheelhouse-uploader fetch command check the hashes?

Indeed this is not the case but this is a good idea.

What do you think about an option for individual hash files for the no-index case?

I am fine with that as well. It's probably a safer way to deal the eventual consistency semantics of most cloud blob stores.