uber/neuropod

Recovery from: "libc++abi.dylib: terminating with uncaught exception of type std::runtime_error"

vkuzmin-uber opened this issue · 7 comments

Bug

I am testing inference with 4 OPE instances. I am using C++ library from CGo service.

Under high load I observed main process termination:

"libc++abi.dylib: terminating with uncaught exception of type std::runtime_error: Neuropod Error: Timed out waiting for a response from worker process. Didn't receive a message in 5000ms, but expected a heartbeat every 2000ms."

The caller has try/catch arounf inference:

    try {
       auto output = neuropod->infer(valueMap);
    } catch (const std::exception& e) {
 ...
   }

and I know that it catches exception in case of IPE (received TF errors). But it seems this is not the case for OPE.

Regardless of what caused it, there are 2 questions:

  1. How can I catch this exception?
  2. If by some reason worker process died, or something else, can I release and load neuropod again to recover?

To Reproduce

Steps to reproduce the behavior:

Load system with high load or somehow else make that parent process doesn't send HeartBeat in time.
UPDATE:
Worker process died. Kill Worker to reproduce.

Expected behavior

Exception should be caught by caller process.

Environment

  • Neuropod Version (e.g., 0.2.0): 0.2.0
  • OS (Linux, macOS):
  • Language (Python, C++, Go bindings): CGo, with C bridge.
  • Python version:
  • Using OPE: yes

Additional context

I suspect that the exception in OPE is not propagated between threads:

C++ supports

std::current_exception();
std::rethrow_exception(teptr);

I grep-ed neuropod and don't see these calls.

UPDATE:
If parent process detect timeout, it reports error and exits. It considers this as critical, non-recoverable problem. Though, in production Worker can dies because of OOM or bugs/leaks in platform. Parent process should be able to stay running and handle it.

If need to confirm that this is a problem of "propagation exceptions between threads", I can debug it. Just thought that someone may be aware of it.

I think I see the code that causes it:

read_worker_(&IPCMessageQueue<UserPayloadType>::read_worker_loop, this)

  void IPCMessageQueue<UserPayloadType>::read_worker_loop()
...
          bool         successful_read =
              recv_queue_->timed_receive(received.get(), sizeof(WireFormat), received_size, priority, timeout_at);

          if (!successful_read)
          {
              // We timed out
              NEUROPOD_ERROR("Timed out waiting for a response from worker process. "
                             "Didn't receive a message in {}ms, but expected a heartbeat every {}ms.",
                             detail::MESSAGE_TIMEOUT_MS,
                             detail::HEARTBEAT_INTERVAL_MS);
          }

As result the exception is sent to read_worker_ thread. I think that instead it should put EXCEPTION message into

              // This is a user-handled message
              out_queue_.emplace(std::move(received));

and this way caller thread will detect it and throw. Let me know if it is correct and I can deal with PR and re-test it.

So when running with OPE, we do indeed propagate exceptions from the worker process:

catch (const std::exception &e)
{
// Send the exception info back to the main process
std::string msg = e.what();
control_channel.send_message(EXCEPTION, msg);
}
catch (...)
{
control_channel.send_message(EXCEPTION, "An unknown exception occurred during inference");
}

Timeouts are handled slightly differently as you've noticed

if (!successful_read)
{
// We timed out
NEUROPOD_ERROR("Timed out waiting for a response from worker process. "
"Didn't receive a message in {}ms, but expected a heartbeat every {}ms.",
detail::MESSAGE_TIMEOUT_MS,
detail::HEARTBEAT_INTERVAL_MS);
}

Note that this gets thrown if we don't have any message available within the timeout; not just heartbeats. This usually happens when the worker process segfaults or crashes in a way that doesn't trigger the try/catch above

If we don't handle timeouts correctly in the message reading thread, it could lead to a deadlock when sending new messages. For example, if the main process is trying to send a message to the worker process and the queue is full, it'll block until there's a spot freed up. However, no progress will be made if the worker process isn't alive and the main thread will block forever. There are solutions to this that don't involve throwing an exception on another thread, but unfortunately it isn't as straightforward as just treating a timeout as another exception.

I think we may be able to modify the message sending logic to handle the deadlock case when queues are full. This should let us remove the NEUROPOD_ERROR on the message reading thread while also not impacting performance.

Can you consistently reproduce the timeout? As I mentioned above, it's not necessarily that there's so much load that the heartbeat can't be sent in time. It happens when no message has been received in 5000ms.

I found some way to reproduce it, interesting that this is related to how client sends it - high load, 20K messages, with 4 OPE instances and 4 concurrent client threads I can reproduce it on my machine. But with 4 OPE instances and 1, 2 and 8 concurrent client threads - no timeout. I don't understand the reason yet and process "termination" doesn't allow to see if neuropod can still serve or some deadlock happened.

I can try to fix my local copy of neuropod and see if it can perform next inference if process is not terminated.

I found that this is because Master process gets SEGV error first and then worker throws exception because of timeout.

#397

@VivekPanyam

Last time we found BUG in neuropod and this wasn't addressed. This is becoming more critical for us since we are moving from Containerized solution to service with multiple models in OPE mode.

We experienced cases when OPE worked died because of:

  • Incompatible backend, tried to load Torchscript 1.7 model at Torchscript 1.1 backend. This was related to "rollback" to old version that had old backend.
  • OOM killer: Containerized app with "quota". If worker process reaches memory restriction (huge model, under high load), OOM killer kills worker process, service crashes after that because of this Issue.

In both cases, service could do "smart" decision if stay running. It makes sense to allow Unload if model was loaded successfully once. Neuropod Core may close IPC objects and if Worker isn't dead really, it will wake, timeout, release resources and exit too. We may even consider allowing core to send KILL signal to worker.

What do you think? Let us know if you need help with fix.

Last time we found BUG in neuropod and this wasn't addressed. This is becoming more critical for us since we are moving from Containerized solution to service with multiple models in OPE mode.

Based on your previous comment, we transitioned focus to #397, but looks like we never resolved this one!

We experienced cases when OPE worked died because of:

  • Incompatible backend, tried to load Torchscript 1.7 model at Torchscript 1.1 backend. This was related to "rollback" to old version that had old backend.
  • OOM killer: Containerized app with "quota". If worker process reaches memory restriction (huge model, under high load), OOM killer kills worker process, service crashes after that because of this Issue.

In both cases, service could do "smart" decision if stay running. It makes sense to allow Unload if model was loaded successfully once. Neuropod Core may close IPC objects and if Worker isn't dead really, it will wake, timeout, release resources and exit too. We may even consider allowing core to send KILL signal to worker.

What do you think? Let us know if you need help with fix.

Makes sense, I'll take another look at this. As I said above, we need to be careful about deadlocks. I'll post another update here once I spend some more time on this today