S3 strategy inconsistent hash for gzipped files
Closed this issue · 1 comments
Hi,
For some of the gzipped files local hash is different from one calculated by AWS S3 (ETag).
I worked it out that it is due to the following line (in strategies/base.py):
zf = gzip.GzipFile(mode="wb", compresslevel=6, fileobj=buffer, mtime=0.0)
There is a mismatch between compression levels (potentially a boto3 change), because currently the default one is 9. And the following code is being used in django-storages (1.8.0) to upload a compressed file into s3 bucket:
def _compress_content(self, content):
"""Gzip a given string content."""
content.seek(0)
zbuf = io.BytesIO()
# The GZIP header has a modification time attribute (see http://www.zlib.org/rfc-gzip.html)
# This means each time a file is compressed it changes even if the other contents don't change
# For S3 this defeats detection of changes using MD5 sums on gzipped files
# Fixing the mtime at 0.0 at compression time avoids this problem
zfile = GzipFile(mode='wb', fileobj=zbuf, mtime=0.0)
try:
zfile.write(force_bytes(content.read()))
finally:
zfile.close()
zbuf.seek(0)
# Boto 2 returned the InMemoryUploadedFile with the file pointer replaced,
# but Boto 3 seems to have issues with that. No need for fp.name in Boto3
# so just returning the BytesIO directly
return zbuf
So if I remove compression level entirely, it works as expected:
zf = gzip.GzipFile(mode="wb", fileobj=buffer, mtime=0.0)
Would it be possible to adjust it?
Created a PR, however not 100% if it will affect other strategies (Google, Azure, etc). Please review.