qcri/Arabesque

Possible deadlock with ODAGs and natural termination of execution.

Closed this issue · 0 comments

Currently, when a partition does not process any embeddings, it sets itself as halted.

Halted partitions resume on message receive. This works as expected when using embedding list communication because messages are targeted to specific partitions and thus wake them up. With ODAGs, however, on an expansion phase, messages are sent at the worker level and thus fail to wake up the halted partitions.

But the ODAG reading code assumes that all partitions are running and initialized a barrier with the total number of local partitions.

For the deadlock to activate, some partitions have to halt while others continue computing. This usually doesn't happen if we prune execution at a certain depth (all workers will terminate simultaneously) and if the number of embeddings we are handling is consistently bigger than the number of partitions (makes it very improbable that some partitions be empty while others not).