Hashing zips with utf-8 filenames causes an error
ninjabear opened this issue · 1 comments
ninjabear commented
def get_zip_hash(obj):
"""Return a consistent hash of the content of a zip file ``obj``."""
digest = hashlib.sha1()
zfile = zipfile.ZipFile(obj, 'r')
for path in sorted(zfile.namelist()):
digest.update(six.text_type(path).encode('utf-8'))
digest.update(zfile.read(path))
return digest.hexdigest()
taken from utils.py errors when it encounters (non ascii) characters. Unfortunately these characters appear in javac
output occasionally (in my case if you're using shapeless or scalaz).
Python handling of character encoding is pretty hairy but the basic problem is:
>>> x = '☹'
>>> print x
☹
>>> x.encode('utf-8')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 0: ordinal not in range(128)
>>> type(x)
<type 'str'>
You can't call .encode('utf-8')
on a string that contains utf-8 characters, unless it is already a unicode string - it'd have to be like this:
>>> y = u'☹'
>>> print y
☹
>>> y.encode('utf-8')
'\xe2\x98\xb9'
>>> type(y)
<type 'unicode'>
However since zip filenames aren't necessarily utf-8 (jar files are, but other zips might not be) the fix ideally should be encoding agnostic.
I will submit a PR soon!
alexanderdean commented
Cheers @ninjabear !