Druid requests go from <100ms latency to 5,10,15 second latency a few minutes after startup. Possible connection limit issue?

Question

Druid requests go from <100ms latency to 5,10,15 second latency a few minutes after startup. Possible connection limit issue?

tshallenberger opened this issue 6 months ago · 1 comments

We run Turnilo in a Docker container, and it uses Plywood to talk to a Druid cluster. When we enable verbose logging, we see requests go from

Requester rq06461 got result from query 269: (in 56ms)
[
  {
    "maxTime": "2024-03-13T23:55:00.000Z"
  }
]

to

Requester rq06461 got result from query 352: (in 10019ms)
[
  {
    "maxTime": "2024-03-14T15:25:00.000Z"
  }
]
TimeMonitor Got the latest time for 'REDACTED' (2024-03-14T15:25:00.000Z)
vvvvvvvvvvvvvvvvvvvvvvvvvv
Requester rq06461 got result from query 350: (in 15283ms)
[
  {
    "maxTime": "2024-03-14T15:25:00.000Z",
    "minTime": "2024-03-12T15:00:00.000Z",
    "timestamp": "2024-03-12T15:00:00.000Z"
  }
]

approximately 2-4 minutes after the container starts. I'm trying to track down the source of the issue, figured I'd raise an issue here to see if anyone had any input. Not sure if this is a Docker (podman), Turnilo, or Plywood issue. Any way I can enable better logging to see if Plywood is the issue?

Answer 1 · 2024-04-05T14:58:12.000Z

This issue was determined to be an issue with how the containers were deployed using podman-kube play files and ran with systemd on RHEL8. Upgrading the kube file generated seemed to fix the issue? The diff between the two config files seemed to only drop some security context capabilities, as the containers were running on an SELinux enabled machine in rootless mode.