http module - incorrect reading gzip compressed stream
grubberr opened this issue · 3 comments
Hello,
import smart_open
url = "https://fonts.googleapis.com/css?family=Montserrat"
headers = {"Accept-encoding": "deflate, gzip"}
result = smart_open.open(url, transport_params={"headers": headers}, mode="rb")
buff = result.read()
print(len(buff))
result = smart_open.open(url, transport_params={"headers": headers}, mode="rb")
buff = result.read(2)
buff += result.read()
print(len(buff))
196
209
196 bytes - gzip compressed result
209 bytes - uncompressed result
This happened because:
in 1-st case library uses self.response.raw.read()
- it returns result as is from server, it's gzip compressed
in 2-nd case library uses self.response.iter_content
- result uncompressed by requests
library
Versions
print(platform.platform())
Linux-5.14.0-1047-oem-x86_64-with-glibc2.31
print("Python", sys.version)
Python 3.9.11 (main, Aug 9 2022, 09:22:28)
[GCC 9.4.0]
print("smart_open", smart_open.__version__)
smart_open 6.0.0
Checklist
Before you create the issue, please make sure you have:
- Described the problem clearly
- Provided a minimal reproducible example, including any required data
- Provided the version numbers of the relevant software
What is the desired behavior here?
in really it's good question
I just pointed on inconsistency
Came across this while trying to solve a problem using smart_open
to read from a range of different URLs.
My code:
with (
so.open(source, 'rb', transport_params={'headers': HEADERS}) as fin,
so.open(destination, 'wb') as fout
):
fout.write(fin.read())
I observed that for some URLs I was able to get a meaningful output file while in other cases it was just gibberish. Comparing between success and failure I determined that the ones that were failing were those with Content-Encoding: gzip
in the response headers.
@grubberr your issue helped pinpoint what was going on; changing my code to the following now works for all URLs:
with (
so.open(source, 'rb', transport_params={'headers': HEADERS}) as fin,
so.open(destination, 'wb') as fout
):
while True:
chunk = fin.read(1024)
if not chunk:
break
fout.write(chunk)
I understand smart_open
uses the extension to determine compression. My failing URL is 'https://www.BCBSIL.com/aca-json/il/index_il.json'
so I guess smart_open
can't determine to use gzip to decompress. I tried using compression='.gz'
when opening the file, but it gave me the following error.
Traceback (most recent call last):
File "<stdin>", line 2, in <module>
File "C:\Users\theog\AppData\Local\Programs\Python\Python39\lib\gzip.py", line 300, in read
return self._buffer.read(size)
File "C:\Users\theog\AppData\Local\Programs\Python\Python39\lib\gzip.py", line 487, in read
if not self._read_gzip_header():
File "C:\Users\theog\AppData\Local\Programs\Python\Python39\lib\gzip.py", line 435, in _read_gzip_header
raise BadGzipFile('Not a gzipped file (%r)' % magic)
gzip.BadGzipFile: Not a gzipped file (b'{\n')
This really puzzled me for a while, but @grubberr 's explanation of result.read()
vs result.read(2)
helps explain this. It looks like gzip is reading in chunks (4th line of stack trace), so even though original content is compressed, gzip is getting the uncompressed (by requests
) content which causes it to raise an error.
What is the desired behavior here?
- I think ideally it would be that the gzip decompression is done transparently for
f.read()
as it is forf.read(n)
. - If that's not possible or too complex, having the difference in behavior clarified in the documentation will probably be useful for other people running into the same problem, and they can implement slightly different code similar to what I've done.
Now that I know what the issue is and how to work around it, this is by no means a showstopper. I do want to say that smart_open
has really made my life much simpler, I appreciate all the work that has gone into this library!