noxdafox/pebble

Pool blocked when child process is OOM killed

lodrion opened this issue · 7 comments

We use pebble pools to parallelize work that saturates both cpus and memory. In most conditions it is able to recover from worker deaths. But under significant memory pressure we often see the the pool locking up and unable to recover or even harvest processes that finished cleanly.

Running py-spy I get the following load dump

100.00% 100.00%   12.00s    12.00s   _send (multiprocessing/connection.py:368)
100.00% 100.00%   12.00s    12.00s   _worker (concurrent/futures/thread.py:78)
  0.00% 100.00%   0.000s    12.00s   schedule (pebble/pool/process.py:217)
  0.00% 100.00%   0.000s    12.00s   send (multiprocessing/connection.py:206)
  0.00% 200.00%   0.000s    24.00s   _bootstrap_inner (threading.py:926)
  0.00% 200.00%   0.000s    24.00s   _bootstrap (threading.py:890)
  0.00% 100.00%   0.000s    12.00s   _send_bytes (multiprocessing/connection.py:404)
  0.00% 200.00%   0.000s    24.00s   run (sentry_sdk/integrations/threading.py:67)
  0.00% 100.00%   0.000s    12.00s   send (pebble/pool/channel.py:76)
  0.00% 100.00%   0.000s    12.00s   task_scheduler_loop (pebble/pool/process.py:167)
  0.00% 100.00%   0.000s    12.00s   dispatch (pebble/pool/process.py:347)
  0.00% 200.00%   0.000s    24.00s   run (threading.py:870)

(so the 100% of CPU is spent trying to send task to the process that I suspect is no longer there)

The stack traces seem to confirm this

Thread 0x7F6CDFD3E700 (idle): "MainThread"
    wait (threading.py:296)
    wait (threading.py:552)
    wait (concurrent/futures/_base.py:301)
    _wait_futures (app/util/prun.py:165)
    _execute_with_retries (app/util/prun.py:99)
    pexecute (app/util/prun.py:60)
    merged_people_parallel (app/merging/people_driver.py:393)
    drive_people_merge_diff_apply (app/merging/people_driver.py:574)
    invoke (dmb_logging/slack_util.py:154)
    _subprocess_run_and_slack_log (app/util/parallel_processing.py:476)
    run (multiprocessing/process.py:99)
    _bootstrap (multiprocessing/process.py:297)
    _launch (multiprocessing/popen_fork.py:74)
    __init__ (multiprocessing/popen_fork.py:20)
    _Popen (multiprocessing/context.py:277)
    _Popen (multiprocessing/context.py:223)
    start (multiprocessing/process.py:112)
    run_forked_multiprocessing_processes (app/util/parallel_processing.py:523)
    run_forked_multiprocessing_process (app/util/parallel_processing.py:581)
    _run_stage (app/cli/merging_pipeline_run.py:224)
    _run_stages_and_update_pipeline_status (app/cli/merging_pipeline_run.py:243)
    run_pipeline (app/cli/merging_pipeline_run.py:335)
    _nightly_run (app/cli/nightly_run_script.py:72)
    main (app/cli/nightly_run_script.py:165)
    invoke (click/core.py:610)
    invoke (click/core.py:1066)
    main (click/core.py:782)
    __call__ (click/core.py:829)
    <module> (app/cli/nightly_run_script.py:176)
    _run_code (runpy.py:85)
    _run_module_as_main (runpy.py:193)
Thread 0x7F6CCBFEC700 (active): "ThreadPoolExecutor-0_4"
    _worker (concurrent/futures/thread.py:78)
    run (threading.py:870)
    run (sentry_sdk/integrations/threading.py:67)
    _bootstrap_inner (threading.py:926)
    _bootstrap (threading.py:890)
Thread 0x7F679D4FF700 (active): "Thread-4"
    _send (multiprocessing/connection.py:368)
    _send_bytes (multiprocessing/connection.py:404)
    send (multiprocessing/connection.py:206)
    send (pebble/pool/channel.py:76)
    dispatch (pebble/pool/process.py:347)
    schedule (pebble/pool/process.py:217)
    task_scheduler_loop (pebble/pool/process.py:167)
    run (threading.py:870)
    run (sentry_sdk/integrations/threading.py:67)
    _bootstrap_inner (threading.py:926)
    _bootstrap (threading.py:890)

So indeed the Thread-4 is stuck trying to send. We have the timeouts set, but they don't seem to be affecting this particular blocker.
Is there a way to put a timeout on the underlying thread?

I think at this point I confirmed that what is happening is that worker processes get killed by OOM killer using SIGKILL. After which the pipe is sitting there completely hanged

Hello,

There is nothing we can do to handle this situation from the application point of view.

A SIGKILL signal to a process will force its exit immediately without any cleanup disregarding in which state the process is. This means that if the process is using some shared resource, the resource might become irrecoverably corrupted. In such cases it's to be expected that any API call against such resources might give undefined behaviour.

Pebble uses SIGKILL to stop unresponsive processes after SIGTERM has been ignored. Nevertheless, it protects the pipe before issuing the signal avoiding any further issue.

The same cannot be done in case external entities such as the OS kill the process. In case of OOM on Linux specifically, it's always better to restart the whole machine as the OOM killer does not provide many guarantees on which processes it will kill. Hence, you cannot guarantee your OS is fully capable after a OOM killer swipe. Core processes like cron might have been caught in its barrage.

The best solution is to avoid your worker processes from swallowing all the OS memory before it needs to act back. If your application is written in pure Python, you can achieve such as per the documentation. If you are relying on lower level C API, then ulimit will be the answer.

Our usecase is the master spinning up a bunch of workers to do CPU (and memory) intensive processing. Given that these are guaranteed to be the most memory hungry processes on the system (and the OOM killer's preference to the kill children rather than the parent) we have a good expectation that the workers will be killed first.

There are two reasons why we'd like to avoid force-setting the memory limits. One is that the master loads huge datastructures into memory before forking off children, so that they can use the shared data, so limit computation becomes very flaky especially in docker context. The other is that that the OOM kills are relatively rare, and since we have retries, it's faster to retry failed processes rather than limit the processing size to the place where we are guaranteed that no process will be killed.

So ideal solution from my perspective would be to protect the pipes and/or make reads/writes non-blocking so that the pool would just report the task as failed and move on.

As I said, there is little we can do at the application level.

The send() gets stuck because the pipe gets full as no-one is draining it. The process which was reading from it was removed abruptly and all the other workers are waiting for it to finish reading. To make things worst, the process owning the pipe Lock is gone and we cannot unlock it anymore.

The pipe could not be drained even if we tried to because we lost the information regarding how long the truncated message was (as documented in here). Hence, we don't know how much data to read from. Even with a non-blocking call, we would just get error every time we interact with it.

If you cannot avoid your machine going OOM, the only way forward would be to terminate the current Pool so you can trash the corrupted pipe and create a new one. This is how we handled similar cases in the past: if all the enqueued futures would time out, then it was a clear sign the Pool's pipe was busted.

We tried in the past to solve this issue in many different ways and the outcome was always the same: performance would degrade significantly and/or the problem would still manifest itself in some other way.

I wonder if just setting the pipe to nonblocking would do the trick? and when you get an error just blowing away to pipe and the underlying process (if any) would probably make the most sense.

I realize that this is probably not the best solution for all usecases, but it is for ours... So is there any way to add this as an optional setting?

It's not that simple, the logic would need to replace most of the internals:

  • Corrupted Pipe
  • Disowned Lock
  • Internal threads
  • All the worker processes

Without considering what to do with the pending futures which original data (function and parameters) went lost.

It's basically simpler to re-create the Pool anew than safely replacing all its internals anyways.

Closing this issue.

As I said above, there is not an easy way to deal with SIGKILL issued from external entities (whether the OS or another process).

The approaches are two:

  1. Avoid the node to reach OOM state by limiting the amount of memory each single worker can consume. It is possible to overprovision the memory while still preventing the box from going OOM as, usually, the OOM itself is caused by a particularly problematic task and not the result of many high-memory task together.
  2. Re-create the pool when the amount of completed task grind to a halt. For that, you can use concurrent.futures.wait with a reasonably high timeout and if none of the listed tasks complete before the timeout you can call ProcessPool.stop and ProcessPool.join. Then you can re-schedule into the new pool all the tasks which were not completed.

The only design which would actually allow to handle such situations while still guaranteeing the continuity of operation would consist of allocating a dedicated pipe per each worker. If one of the pipes get corrupted because the worker is killed abruptly, the single worker-pipe can be replaced. This design was tried many years ago and it resulted in a very slow implementation due to the amount of overhead deriving from handling several I/O pipes concurrently. At that point, the concept was abandoned as the whole library was starting to look more like a solution such as celery or luigi which were already quite mature at the time. You might want to evaluate the two mentioned library instead of Pebble if the suggested approaches are not feasible.