antonagestam/collectfast

S3 strategy inconsistent hash for gzipped files

Closed this issue · 1 comments

Hi,

For some of the gzipped files local hash is different from one calculated by AWS S3 (ETag).
I worked it out that it is due to the following line (in strategies/base.py):

zf = gzip.GzipFile(mode="wb", compresslevel=6, fileobj=buffer, mtime=0.0)

There is a mismatch between compression levels (potentially a boto3 change), because currently the default one is 9. And the following code is being used in django-storages (1.8.0) to upload a compressed file into s3 bucket:

    def _compress_content(self, content):
        """Gzip a given string content."""
        content.seek(0)
        zbuf = io.BytesIO()
        #  The GZIP header has a modification time attribute (see http://www.zlib.org/rfc-gzip.html)
        #  This means each time a file is compressed it changes even if the other contents don't change
        #  For S3 this defeats detection of changes using MD5 sums on gzipped files
        #  Fixing the mtime at 0.0 at compression time avoids this problem
        zfile = GzipFile(mode='wb', fileobj=zbuf, mtime=0.0)
        try:
            zfile.write(force_bytes(content.read()))
        finally:
            zfile.close()
        zbuf.seek(0)
        # Boto 2 returned the InMemoryUploadedFile with the file pointer replaced,
        # but Boto 3 seems to have issues with that. No need for fp.name in Boto3
        # so just returning the BytesIO directly
        return zbuf

So if I remove compression level entirely, it works as expected:
zf = gzip.GzipFile(mode="wb", fileobj=buffer, mtime=0.0)

Would it be possible to adjust it?

Created a PR, however not 100% if it will affect other strategies (Google, Azure, etc). Please review.