StanfordLegion/legion

Legion: Non-deterministic segfault in Pennant

Closed this issue · 15 comments

Pennant, when run on 2 nodes on Perlmutter, will non-deterministically segfault with this backtrace. I'm on commit fc6364. I cannot recreate this issue on Sapling. The command line I'm using is:

GASNET_BACKTRACE=1 srun --output=bt.log --gpus-per-node 4 --unbuffered -n 2 -N 2 --ntasks-per-node 1 --cpu_bind none ../regent.py pennant.rg -fpredicate 0 -fflow 0 -fopenmp 0 pennant.tests/leblanc_long8x1000/leblanc.pnt -npieces 8 -numpcx 1 -numpcy 8 -seq_init 0 -par_init 1 -hl:sched 1024 1 -ll:gpu 4 -ll:io 1 -ll:util 2 -ll:bgwork 4 -ll:fsize 39000 -ll:csize 36000 -ll:zsize 39000 -ll:rsize 0 -ll:gsize 0 -lg:eager_alloc_percentage 10 -dm:memoize 1 -level runtime=5

What are the values of this->owner_space and this->local_space in frame 11 of thread 12?

For what it's worth, this error is very random. I had to rerun the program half a dozen times before I hit it again. Also the backtraces are slightly different.

I went to the corresponding frame and printed what you asked:

#10 0x00007fddbb986a75 in Legion::Internal::FutureImpl::unpack_future_result (this=0x7fa428073ab0, derez=...)
    at /global/u2/r/rsoi/legion/runtime/legion/runtime.cc:2245
2245	           ApEvent ready = pending->second.instance->copy_from(instance,
(gdb) p this->owner_space
$1 = 1
(gdb) p this->local_space
$2 = 0

Here is another backtrace. This one is from 4 node run of a modified version of Pennant, in which, as usual, there is a wrapper task that is control replicated in addition to the top-level task. I don't know if this is caused by the same underlying bug or not.

And, for what it's worth, when I process the logs for the runs that succeed (since these errors are random), I always seem to get the following warning:

WARNING: A significant number of long latency messages were detected during this run meaning that the network was likely congested and could be causing a significant performance degredation. We detected 96998 messages that took longer than 1000.00us to run, representing 27.96% of 346948 total messages. The longest latency message required 196069.68us to execute. Please report this case to the Legion developers along with an accompanying Legion Prof profile so we can better understand why the network is so congested.

Try this branch:
https://gitlab.com/StanfordLegion/legion/-/merge_requests/1160

And, for what it's worth, when I process the logs for the runs that succeed (since these errors are random), I always seem to get the following warning:

To be clear: this warning has nothing to do with the future issues. The most likely cause is because you are doing lots of unnecessary remote task mappings. Try putting this assertion at the top of every task body:

assert(task->orig_proc.address_space() == task->current_proc.address_space());

Then fix your mapper so that you never hit that assertion and see if the performance warning goes away.

In case this is still useful:

(gdb) f 10
#10 0x00007f04dc98b3cd in Legion::Internal::FutureImpl::unpack_future_result (
    this=0x7ecb540cb670, derez=...) at /global/u2/r/rsoi/legion/runtime/legion/runtime.cc:2274
p2274	             ApEvent ready = pending->second.instance->copy_from(instance,
(gdb) p this->owner_space
$1 = 1
(gdb) p this->local_space
$2 = 0

I need a reproducer for this one. I have no idea how you are managing this. You have some kind of really crazy mapping happening if you're triggering that assertion.

This is just vanilla Pennant on Perlmutter, so if you can get an account on it you should be able to reproduce this yourself. I tried again but was unable to reproduce this on Sapling.

I'm not going to be able to get an account on Perlmutter.

What is the command line that you are using to produce the bug on Perlmutter?

The command line is in the issue description above. You can find the testcase I'm using, leblanc_long8x1000 here.

Using the assertion you provided, I confirmed that the only tasks that violate it are toplevel and wrapper, presumably because they are control replicated. But profiles from the same run still throw the long latency messages warning, so it must be caused by something other than remote task mappings.

Pull and try again to confirm that the assertion is fixed.

Confirmed, thanks.

I just want to confirm that the solution is not fully merged into master. @lightsighter ?

I already merged the change into the master branch.