quickwit-oss/quickwit

Searcher gets deconnected on query

Closed this issue · 1 comments

Observed on Airmail...

Some searcher does not answer its health probe on time and gets killed by k8s.

The current theory is that we have been using spawn_blocking a little bit too liberally in the code...
We use it in many places to avoid CPU load on the runtime threads.

The trouble with this is that the associated pool thread is not bounded, because it is designed to deal with blocking IO.
500 threads trying to decompress some docstore block + 1 runtime thread is likely to stall the runtime thread for a large amount of time.

We can confirm that idea, and generally speaking, it is worth auditing all of our usage of spawn_blocking.

Other possibilities would be something holding the runtime threads, or some variants of the lifo slot bug, but the latter is very unlikely.

so far, we think the source of the issue is this line. It's a cpu-bound task (decompress around 1MB worth of zstd), which in the context of airmail, can be called up to 150 times before yielding. This means we can likely keep a worker busy upwards of 500ms. It isn't noticeable with no query parallelism (other workers are still answering), but serving a couple answers at once can easily makes us unresponsive