spotify/heroic

Heroic appears to mis-handle `EsRejectedExecutionException` errors from ElasticSearch in batched queries

Opened this issue · 2 comments

rf commented

I'm seeing an issue in production where a EsRejectedExecutionException result from ElasticSearch will cause Heroic to return a correct-looking response but with empty results and empty errors. The responses look like this:

    "redactedkey:1": {
      "queryId": "21ee7a67-1023-495b-b2dd-55609d914f7a",
      "range": {
        "start": 1593316800000,
        "end": 1595736600000
      },
      "trace": {
        "what": {
          "name": "com.spotify.heroic.CoreQueryManager#query"
        },
        "elapsed": 537117,
        "children": [
          {
            "what": {
              "name": "com.spotify.heroic.CoreQueryManager#query_shard[{site=us-east1}]"
            },
            "elapsed": 537157,
            "children": [
              {
                "what": {
                  "name": "grpc://heroic-2.heroic.default.svc.cluster.local.:9698#query"
                },
                "elapsed": 537078,
                "children": [
                  {
                    "what": {
                      "name": "com.spotify.heroic.metric.LocalMetricManager#query"
                    },
                    "elapsed": 327049,
                    "children": []
                  }
                ]
              }
            ]
          }
        ]
      },
      "limits": [],
      "cached": false,
      "cache": null,
      "commonTags": {},
      "commonResource": {},
      "result": [],
      "preAggregationSampleSize": 0,
      "errors": []
    },

And in our heroic logs:

Failed to execute phase [fetch],
21:47:18.350 [elasticsearch[_client_][listener][T#3]] ERROR com.spotify.heroic.rpc.grpc.GrpcRpcProtocolServer - 29c3cfe3-ca6c-4c16-bc16-b755c8867848: Request failed
        ... 48 more
        at org.elasticsearch.common.util.concurrent.EsThreadPoolExecutor.doExecute(EsThreadPoolExecutor.java:94)
        at java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1379)
        at java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:830)
        at org.elasticsearch.common.util.concurrent.EsAbortPolicy.rejectedExecution(EsAbortPolicy.java:50)
Caused by: EsRejectedExecutionException[rejected execution of org.elasticsearch.action.search.FetchSearchPhase$1@4606797a on EsThreadPoolExecutor[search, queue capacity = 1000, org.elasticsearch.common.util.concurrent.EsThreadPoolExecutor@433876a3[Running, pool size = 4, active threads = 4, queued tasks = 1000, completed tasks = 3259109]]]
        at java.lang.Thread.run(Thread.java:748)
        at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:858)
        at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:458)
        at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:498)
        at io.netty.channel.nio.NioEventLoop.processSelectedKeysPlain(NioEventLoop.java:544)
        at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:644)
        at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:134)
        at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:926)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362)
        at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1334)
        at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362)
        at io.netty.handler.logging.LoggingHandler.channelRead(LoggingHandler.java:241)
        at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362)
        at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:265)
        at io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:413)
        at io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:297)
        at io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:310)
        at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362)

Heroic should pass this response along to the client in the errors array.

Thanks for reporting this! Just to confirm, are you running version 2.2.0 or an earlier one?

We are running 0.10.5 at the moment.