Massively broken with gzip encoded streams

Question

Massively broken with gzip encoded streams

Count-Count opened this issue 5 years ago · 8 comments

Due to the changes made for the short read functionality using streams which are gzip encoded is massively broken. Accessing the raw response content bypasses gzip decoding and thus the event stream cannot be read.

This happens for Wikimedia event streams sometimes (gzip encoding is not used in all cases, not sure in which cases it is used).

See https://requests.readthedocs.io/en/latest/user/quickstart/#raw-response-content

Answer 1 · 2019-12-15T10:29:35.000Z

@mutantmonkey FYI

Answer 2 · 2019-12-18T00:33:07.000Z

Related bug: #27
I'm not really sure why it was closed as it seems to be the exact same problem mentioned here.

Answer 3 · 2019-12-18T01:03:41.000Z

There are a few different possible approaches to fix this:

Disable short reads when gzip encoding is used, as you've done in #37. The downside of this is that #8 and #9 will be resurfaced for users who are also using gzip encoding.
Disable gzip encoding by overriding the Accept-Encoding header that requests sets automatically, as mentioned in #27. The downside of this is that we won't get the benefit of gzip compression.
Fix short reads so they also work with gzipped content.

I will see if I can come up with a pull request that takes the third approach. I was not aware that requests even supported gzip encoding.

Answer 4 · 2019-12-18T18:34:19.000Z

Approach no. 3 sounds good. Shouldn't it just work(tm) if we use the high-level iter_content() with chunk_size=None? stream is already set to True.

def iter_content(self, chunk_size=1, decode_unicode=False):
        """Iterates over the response data.  When stream=True is set on the
        request, this avoids reading the content at once into memory for
        large responses.  The chunk size is the number of bytes it should
        read into memory.  This is not necessarily the length of each item
        returned as decoding can take place.

        chunk_size must be of type int or None. A value of None will
        function differently depending on the value of `stream`.
        stream=True will read data as it arrives in whatever size the
        chunks are received. If stream=False, data is returned as
        a single chunk.

        If decode_unicode is True, content will be decoded using the best
        available encoding based on the response.
        """

Answer 5 · 2019-12-18T19:46:24.000Z

The documentation makes it sound like it should, but that's not the case unfortunately. If you trace things back through urllib3's underlying stream and read functions, through http.client.HTTPResponse.read, you ultimately end up at at a call to io.BufferedReader.read which per the Python docs will block until EOF, so setting chunk_size=None means you will receive no events until EOF.

Answer 6 · 2019-12-19T23:28:09.000Z

@mutantmonkey What was the problem with a chunk_size of 1?

Answer 7 · 2019-12-20T04:25:49.000Z

Using a chunk size of one causes unnecessarily high CPU usage because each time a byte is received, it has to be processed by the Python code instead of just being added to a buffer and processed all at once. This library used to do that, but 6820dc8 changed that behavior.

In any case, I believe I may have a fix that will work, but I need an endpoint with gzip enabled to test on. If you happen to have one handy, please share it, otherwise I can try to set something up but it will take another couple of days.

Answer 8 · 2020-03-23T01:38:04.000Z

@Count-Count @mutantmonkey