piskvorky/smart_open

http module - incorrect reading gzip compressed stream

grubberr opened this issue · 3 comments

Hello,

import smart_open

url = "https://fonts.googleapis.com/css?family=Montserrat"
headers = {"Accept-encoding": "deflate, gzip"}

result = smart_open.open(url, transport_params={"headers": headers}, mode="rb")
buff = result.read()
print(len(buff))

result = smart_open.open(url, transport_params={"headers": headers}, mode="rb")
buff = result.read(2)
buff += result.read()
print(len(buff))

196
209

196 bytes - gzip compressed result
209 bytes - uncompressed result

This happened because:
in 1-st case library uses self.response.raw.read() - it returns result as is from server, it's gzip compressed
in 2-nd case library uses self.response.iter_content - result uncompressed by requests library

Versions

print(platform.platform())
Linux-5.14.0-1047-oem-x86_64-with-glibc2.31
print("Python", sys.version)
Python 3.9.11 (main, Aug  9 2022, 09:22:28) 
[GCC 9.4.0]
print("smart_open", smart_open.__version__)
smart_open 6.0.0

Checklist

Before you create the issue, please make sure you have:

  • Described the problem clearly
  • Provided a minimal reproducible example, including any required data
  • Provided the version numbers of the relevant software

What is the desired behavior here?

in really it's good question
I just pointed on inconsistency

Came across this while trying to solve a problem using smart_open to read from a range of different URLs.
My code:

with (
    so.open(source, 'rb', transport_params={'headers': HEADERS}) as fin,
    so.open(destination, 'wb') as fout
):
    fout.write(fin.read())

I observed that for some URLs I was able to get a meaningful output file while in other cases it was just gibberish. Comparing between success and failure I determined that the ones that were failing were those with Content-Encoding: gzip in the response headers.

@grubberr your issue helped pinpoint what was going on; changing my code to the following now works for all URLs:

with (
    so.open(source, 'rb', transport_params={'headers': HEADERS}) as fin,
    so.open(destination, 'wb') as fout
):
    while True:
        chunk = fin.read(1024)
        if not chunk:
            break
                
        fout.write(chunk)

I understand smart_open uses the extension to determine compression. My failing URL is 'https://www.BCBSIL.com/aca-json/il/index_il.json' so I guess smart_open can't determine to use gzip to decompress. I tried using compression='.gz' when opening the file, but it gave me the following error.

Traceback (most recent call last):
  File "<stdin>", line 2, in <module>
  File "C:\Users\theog\AppData\Local\Programs\Python\Python39\lib\gzip.py", line 300, in read
    return self._buffer.read(size)
  File "C:\Users\theog\AppData\Local\Programs\Python\Python39\lib\gzip.py", line 487, in read
    if not self._read_gzip_header():
  File "C:\Users\theog\AppData\Local\Programs\Python\Python39\lib\gzip.py", line 435, in _read_gzip_header
    raise BadGzipFile('Not a gzipped file (%r)' % magic)
gzip.BadGzipFile: Not a gzipped file (b'{\n')

This really puzzled me for a while, but @grubberr 's explanation of result.read() vs result.read(2) helps explain this. It looks like gzip is reading in chunks (4th line of stack trace), so even though original content is compressed, gzip is getting the uncompressed (by requests) content which causes it to raise an error.

What is the desired behavior here?

  • I think ideally it would be that the gzip decompression is done transparently for f.read() as it is for f.read(n).
  • If that's not possible or too complex, having the difference in behavior clarified in the documentation will probably be useful for other people running into the same problem, and they can implement slightly different code similar to what I've done.

Now that I know what the issue is and how to work around it, this is by no means a showstopper. I do want to say that smart_open has really made my life much simpler, I appreciate all the work that has gone into this library!