STEllAR-GROUP/hpx

handle_received_parcels may never return

JiakunYan opened this issue · 8 comments

When the parcelport completes the receiving of a parcel, it should call the handle_received_parcels function to deliver this parcel to the upper layer. However, I found there are chances that this function will never return.

In most cases, this will not result in severe consequences, as most of those invocation happens in the background_work function, which will be called repeatedly by the background threads anyway. (They may still cause some buffers to be not freed though.) However, in LCI, because the aggressive strategy of overlapping sends/recvs together, this could cause some sends never complete and the application hang.

By adding logs before and after the invocation of handle_received_parcels, I found only 1669 of the first 1698 invocation actually returned.

@hkaiser Is this the expected behavior of handle_received_parcels?

Revise: In the LCI parcelport, when the worker thread starts sending parcels, if the send temporarily fails, the worker thread can call the background_work function, which can call the handle_received_parcels function and never return (and then the sending is pending forever and the application hangs).

However, I just found, if I don't let the worker thread call background_work function (so only the background thread is calling it), all calls to handle_received_parcels will return.

Do you have an idea why a worker thread calling the background_work function can be a problem?

Update: I see the issue. In general, it is not safe. Imagining a worker thread is in the middle of sending a message for Task A. It calls the handle_received_parcels and starts to execute Task B, which is put to sleep waiting for condition C to happen. Condition C can depend on A and then we have a deadlock.

@JiakunYan is there anything we can do to avoid this situation?

For now, I just disabled the option that calls background_work when sending. It would be useful if there is an option to tell the handle_received_parcels to always spawn independent user-level threads, instead of directly running it in the current thread.

I'd like to better understand this problem. Could you please elaborate in more detail under what circumstances this is happeneing?

Sure, how about this?

LCI sends can fail temporarily due to resources unavailable. I have an option in the LCI parcelport that, once enabled, will call the background_work function when send fails.

pseudo-code for send_parcel:

while (LCI_send(header) == RETRY) {
  do_background_work();
{

However, I found this can lead to deadlock in some cases. My assumption is:

  1. Worker thread 1 is sending parcel A. This parcel A will trigger another task A.
  2. Parcel A's send temporarily fails, so worker thread A calls the background_work function.
  3. Worker thread A's background_work function invocation receives parcel B and invokes the handle_receive_parcels function.
  4. Parcel B triggers an HPX direct action (task B), which will directly be executed on worker thread A (if my understanding of direct action is correct)
  5. Task B waits on some condition variable C, which will cause worker thread 1 to be put to sleep.
  6. Condition C depends on the execution of task A, but worker thread 1 has been put to sleep, which means parcel A will never be sent out.

@JiakunYan instead of calling the do_background_work() yourself, simply don't do anything in this case (return control to the scheduler) and let HPX invoke your parcelport again during its next invocation of the background work. I don't think there is a need for you to invoke it directly as HPX will call it in regular intervals anyways.

Sorry, forgot about this issue.

Yes, I ended up just calling the yield() function. The only problem is that, when there are many tasks, the background_work function will not be invoked frequently, but we can control that with max_busy_loop.