elastic/elasticsearch-py

Regression in 6.4: Scroll failes with large scroll_id

ChrisPortman opened this issue ยท 18 comments

changes to the scoll method in 6.4 submits the scroll id as part of the URL. This causes:

elasticsearch.exceptions.RequestError: RequestError(400, 'too_long_frame_exception', 'An HTTP line is larger than 4096 bytes.')

When there are a large number of shards involved creating a large scroll id.

def scroll(self, scroll_id=None, body=None, params=None):

I struggled with this same problem today. I found that I could resolve the problem by making some minor changes to the scroll method, using POST instead of GET and stuffing the scroll_id into the body.

https://github.com/elastic/elasticsearch-py/blob/master/elasticsearch/client/__init__.py#L1300

    @query_params("scroll", "rest_total_hits_as_int", "scroll_id")
    def scroll(self, body=None, params=None):
        """
        Scroll a search request created by specifying the scroll parameter.
        `<http://www.elastic.co/guide/en/elasticsearch/reference/current/search-request-scroll.html>`_

        :arg scroll_id: The scroll ID
        :arg body: The scroll ID if not passed by URL or query parameter.
        :arg scroll: Specify how long a consistent view of the index should be
            maintained for scrolled search
        :arg rest_total_hits_as_int: This parameter is used to restore the total hits as a number
            in the response. This param is added version 6.x to handle mixed cluster queries where nodes
            are in multiple versions (7.0 and 6.latest)
        """
        _body = f'"scroll_id": {body}'
        return self.transport.perform_request(
            "POST", "/_search/scroll", params=params, body=_body
        )

I'm sorry you're running into issues, but unfortunately (or fortunately, because it's a config change and not a code change/new release) this error is not to do with the Elasticsearch-py client, but Elasticsearch itself.

The error you're getting is coming back directly from Elasticsearch.

For reference please see: elastic/elasticsearch#3210

In order to fix this you need to adjust some of the parameters in Elasticsearch itself.
For this particular error you need to change http.max_initial_line_length.

I hope this helps

Respectfully, I would like to disagree. Reverting to version 6.3 of the client, with no other changes in server configuration resolved my issue.

6.3.1 of the client would include the scroll_id in the body where there was not other body provided. If there is a specific body supplied, it would resort to putting the scroll_id in the query params.

6.4 skips attempting to put the scroll_id in the body and only uses the query params.

I agree with @ChrisPortman. This change caught me off guard when I upgraded my cluster from 6.x to 7.x last week. Given that the scroll_id length scales with the number of shards, my smaller test cluster did not run into the issue with long scroll IDs, and I ended up with an unexpectedly broken production app.

While increasing http.max_initial_line_length will likely solve this issue for most clusters, larger clusters will require ever larger values, and the setting is not dynamic so changing it can be pretty time consuming, especially so for large clusters. Reverting back to passing scroll_id in the body instead of as a query param neatly solves this problem more completely, in my opinion.

In the meantime, I am using a forked version of this package where I switched back to passing the scroll_id in the request body, which has solved this issue for me. I've opened a PR in #973 with that change.

One additional comment. Commit 4030400, created in December 2013, had the scroll method send a POST request to /search/_scroll. This behaviour was changed to a HTTP GET in commit 619e7eb, in June 2014.

Hi everyone - thank you very much for raising this issue. I wanted to dig in into this change a bit to find out why it has been introduced into the codebase in the first place to make sure that reverting this back to POST won't introduce problems for someone else. As @gmazzola noted, back in December 2013, the scroll method would send a POST request to the endpoint and then in June 2014 this was changed to a GET method in order to enable method-based access control as you can see from the commit message that @gmazzola linked above. The use case that this commit envisioned was a situation where someone is using a reverse proxy to secure a read-only cluster. In this scenario, the reverse proxy filters out all other requests except for GET requests and thus in order to allow folks with this kind of an architecture to read data using the scroll endpoint, the method was altered to use GET instead of POST.

I need to clarify that this has issue, is not about POST vs GET. The breaking change is in two different approaches to performing the GET.

This issue is caused by a change to the client library code that occurred in version 6.4. Where the behaviour of the scroll method changed such that it would no longer send the scroll I'd in the GET body, only in the query params.

This is the 6.3 version of the method:

def scroll(self, scroll_id=None, body=None, params=None):

This is the 6.4 version:

def scroll(self, scroll_id=None, body=None, params=None):

You can see that in 6.3, it first tries to use the body to carry the scroll ID. 6.4 makes no such attempt and just puts it in the path

Specifically, it was the changes to the scroll method in this commit:
7ff0cd5?diff=unified#diff-2696ade2db2bc9506566dc21df9e00e8L1227

As discussed above, the root cause of this problem is Elasticsearch rejecting HTTP GET queries that are longer than http.max_initial_line_length -- a server-side configurable setting. Certain scroll operations can produce a large scroll_id that exceed this value.

The good news is that this http.max_initial_line_length setting is exposed via the cluster API:

>>> client=Elasticsearch()
>>> client.cluster.get_settings(include_defaults=True)['defaults']['http']['max_initial_line_length']
'4kb'
>>>

Would you consider a patch to the scroll method that did the following:

  1. Query cluster.get_settings for the current http.max_initial_line_length value. Perhaps we cache this result; I'll accept feedback there.
  2. If the scroll_id will cause us to exceed max_initial_line_length, switch to HTTP POST.
  3. Otherwise use GET.

This is obviously more complex than the current implementation, but it preserves the reverse-proxy and large-scroll use cases.

Once we achieve consensus, I'm happy to implement these changes in my PR.

Is there any reason not to just roll the method back to the 6.3 state?

Yeah, the only issue here is that the method was recently switched from passing scroll_id in the request body to passing it as a query parameter. It can remain a GET, it just needs to return to passing scroll_id in the request body.

I personally don't call scroll directly. I ran into this issue using the scan helper, which I patched in #973, but I think it might make more sense to return to the original scroll functionality rather than patch scan to get around the new functionality.

As first priority, I'll check if we can go back to the 6.3 functionality where scroll_id was passed in the request body. If not, let's take a look at the other options suggested.

It seems the fix (#973) is present on master but no new release containing it?

@skbly7 correct, it's been merged to master and we're working on the release. Hopefully won't be too long!

Has #973 made it's way into a release yet? Currently seeing this behavior via the scan helper in the 7.0.4 version.

@bbohen, yes #973 was released with the Client's 7.0.4 version.

Can confirm, I had this issue for 7.0.3 and upgrading to 7.0.5 (currently most recent release on pip) resolved this issue. Thank you @ChrisPortman for creating this issue, and @jtyoung and @Winterflower for fixing and merging it!

I believe this is closed in both 6.x and 7.x, if I'm mistaken please let me know or reopen a new issue. Thanks all! :)