elastic/elasticsearch

Document maximum size for bulk indexing over HTTP

Closed this issue · 7 comments

There seems to be a maximum record count for bulk indexing of about 100k records over HTTP. It does not seem to be documented.

Nevermind, found it here as a max http request size: http.max_content_length setting

I've increased the setting but still get the error from Netty: HTTP content length exceeded 104857600 bytes.
This is because: maxContentLength[2.9gb] set to high value, resetting it to [100mb]
Netty expects an integer and checks against Integer.MAX_VALUE, or 2^31-1.

So, basically, 2gb is the maximum size. ES does not process an HTTP request until it completes: for those working with large files, it makes quite a bit more sense to use a streaming API or otherwise run a river on the server. In an example of 2m records at about 2g uncompressed, it took longer to upload through netty than to index: indexing does not start until the http request has completed (which is reasonable).

Yea, it represents the whole request in memory (thats how async IO works, and the fact that it needs to break the bulk request to each shard). You should make sure to break down your indexing into smaller bulk sizes.

from netty docs on http://docs.jboss.org/netty/3.2/api/org/jboss/netty/handler/codec/http/HttpMessageDecoder.html

Name Meaning
maxInitialLineLength The maximum length of the initial line (e.g. "GET / HTTP/1.0" or "HTTP/1.0 200 OK") If the length of the initial line exceeds this value, a TooLongFrameException will be raised.
maxHeaderSize The maximum length of all headers. If the sum of the length of each header exceeds this value, a TooLongFrameException will be raised.
maxChunkSize The maximum length of the content or each chunk. If the content length (or the length of each chunk) exceeds this value, the content or chunk will be split into multiple HttpChunks whose length is maxChunkSize at maximum.

As you can see, just like tomcat, netty supports chunking…. so async IO may represent the whole request in memory, but not at the Netty level, eh? Famous bug in tomcat early v 6 around chunking…..

And since request can be sent in chunks also, there is no system async io that knows if the entire request is complete, it only knows if all tcp segments are received and correct.

will

On Sep 6, 2012, at 10:36 AM, Shay Banon wrote:

Yea, it represents the whole request in memory (thats how async IO works, and the fact that it needs to break the bulk request to each shard). You should make sure to break down your indexing into smaller bulk sizes.


Reply to this email directly or view it on GitHub.

Http chunking, or any chunking in general does not solve "boundaries" problems when it comes to bulk requests. Not saying that its not impossible to try and support chunking with bulk (find boundaries), and then sending those chunks to shards, and keep on doing it and send back the responses in chunks. Its quite complicated though. What you need to do now is simply break the bulk requests yourself to chunks.

I have the same issue, we are indexing and saving documents using _bulk endpoint, we knew that the maximum threshold for HTTP request payload is up to 10MB so we chunks our bulk saving up to 8MB but still we got the same exception - below is the sample exception we got from our code:

[POST] URL [https:///_bulk]
RESPONSE_BODY [{"Message":"Request size exceeded 10485760 bytes"}]
REQUEST PARAM SIZE [8298283 Bytes]
REQUEST_PARAM [{"index":{"_index":"indexname here","_type":"item","_id":"https://..."}}]

Removed other information (company information), but as you can see it was throwing an exception that we exceeded the 10MB threshold but request payload that we sent is just around 8MB.

By the way we are using AWS hosted Elasticserch. Hoping to hear from the expert here. Thanks.

Ador

@donzkie the same for me.
curl -H 'Content-Type: application/x-ndjson' -XPOST 'https://XXX.es.amazonaws.com/shakespeare/doc/_bulk?pretty' --data-binary @shakespeare_6.0.json
{"Message":"Request size exceeded 10485760 bytes"}