StanfordLegion/legion

Modified Pennant does not weak scale on Perlmutter

Opened this issue · 28 comments

I'm running (as usual) a modified version of Regent Pennant on Perlmutter. This code does not weak scale to even 64 nodes, and Elliott and I have been unable to diagnose the root cause.

Here are three profiles: 4, 16, and 64 nodes. The last one has the first 8 nodes only.

This version of Pennant has two wrapper tasks. The first one executes twice, the second executes once. Consequently there are three plateaus of GPU utilization. They are delimited by peaks of CPU utilization corresponding to region copies. All this is expected.

The scaling problem seems to stem from the gap between the end of the first wrapper task and the start of the second. This gap increases with node count, going from 0.6 s on 4 nodes, to 1.5 s on 16, and to 8 s on 64.

Furthermore, there is a stack of equivalence set tasks on the utility processors in this gap in all three profiles. The size of this stack grows with node count. It is thousands of tasks tall on 64 nodes.

Most of these equivalence set tasks appear to be waiting for most of their lifetimes. Their lifetimes increase with node count: on 64 nodes the average lifetime is about 200 ms, on 4 and 16 nodes it is around 2 ms. There are no remote task mappings in this code, except the two control replicated tasks.

No channel appears to be overloaded in any profile. We have had scaling problems in the past caused by channel overutilization due to a bad mapping, but this does not appear to be the culprit here.

I would appreciate any pointers for debugging this issue. Thanks.

Try not using virtual mappings with control-replicated tasks. Always physically map your region requirements for control-replicated tasks. The implementation of virtual mapped regions for control replication is completely un-optimized. Optimizing it would take at least an entire month of engineering work and nobody has demonstrate a compelling use case for it that couldn't be fixed by not using virtual mappings.

Because the data does not fit on a single node, the only way to do this is to pass $\mathcal{O}(N)$ regions into the control-relicated wrapper task. Do you actually expect that to be more efficient than virtual mapping?

Presumably if you've already control replicated the top-level task, then control-replication of any sub-tasks further down the task tree shouldn't need to span the whole machine and not be a scalability issue. If your case is different then that you'll need to make a compelling use case in the Legion meeting for why that is the case and why it should be prioritized above other things. Making virtual mappings with control-replicated tasks scale to the entire machine is going to be a monstrously huge amount of work.

What we're doing is outlining parts of the top-level task. So if your original task was:

for t = 0, T do
  __demand(__index_launch)
  for i = 0, N do
    ...
  end
  __demand(__index_launch)
  for i = 0, N do
    ...
  end
end

Then we're outlining a task that looks like:

task wrapper(..., T_start : int, T_stop : int)
  for t = T_start, T_stop do
    __demand(__index_launch)
    for i = 0, N do
      ...
    end
    __demand(__index_launch)
    for i = 0, N do
      ...
    end
  end
end

And the top-level task becomes:

for t = 0, T, step_size do
  wrapper(..., t, t + step_size - 1)
end

So the entire point of this is basically to slice the top-level task into parts we can run separately. Therefore, it does need access to the entire tree.

Would the above approach work better if we inline-launched the wrapper tasks?

Rupanshu reminded me that if we inline launch the wrapper task, this will introduce blocking into the top-level task. For the experiments Rupanshu is currently working toward, this may be an acceptable workaround. However, part of the point of our approach was to try to demonstrate that we can isolate self-contained resilient tasks within a larger task graph, and blocking (and inlining in general) torpedoes one of our major claims, which is that we can make applications resilient in a piecemeal fashion, with minimal interference to the rest of the task graph.

I totally get that this may be a large and unreasonable amount of work given other priorities, but now you hopefully at least understand why we wanted it.

Fwiw, Stencil is also exhibiting the same behavior now at 128 nodes. Here are some profiles: 2, 64, 128 nodes. Note the increasing gap between the 1st and 2nd wrapper tasks, just like Pennant.

Would the above approach work better if we inline-launched the wrapper tasks?

Yes

Rupanshu reminded me that if we inline launch the wrapper task, this will introduce blocking into the top-level task.

Why does it introduce blocking?

I totally get that this may be a large and unreasonable amount of work given other priorities, but now you hopefully at least understand why we wanted it.

Modulo the above questions I think I understand. Virtual mapping (especially with read-write privileges) is not free. You get to avoid making the instances, but at the cost of considerably more meta-data movement. We could reduce the meta-data movement, but then you would lose the ability to make new refinements. There are further optimizations to be done here but they are complex and tedious and will take lots of work to make them first functional and later performant.

Why does it introduce blocking?

We're trying to demonstrate an application-level resilience technique. The basic idea is that each resilient task is responsible for its own resilience. That means making a backup of any data on entry to the task, running the computation, and then restoring the backup and rerunning if there is a failure.

Now, if the task is a leaf task, this is all fine. You can easily make this self-contained and nothing else in the program needs to know anything about it (or see any performance impact, unless the task actually fails).

However, the vast majority of our applications have very small leaf tasks. They cannot amortize the overhead of doing this because there just isn't enough work. So in order to amortize that overhead, we need to lift this operation up to the level of the parent task that calls these tasks.

We don't want to do it to the top-level task, because restarting amounts to rerunning the entire program from the beginning. There would be no benefit.

So we have to slice the top-level task into pieces. That's where the wrapper tasks come in. We enter, copy data, run some subset of the timestep loop, check if there are any errors, and restore (and rerun) if necessary. It's the checking for errors (along with the control-flow-dependent restore and rerun tasks) that require this code to block.

If the wrapper is its own control replicated task, this is still mostly isolated from the application. You can run other concurrent work while the resilient task is executing and all of that mostly doesn't even need to know anything is going on. However, if you inline the wrapper task, now we introduce blocking into the top-level task which potentially interferes with the ability to discover and expose other parallel work.

I rewrote my version of Pennant to not use any virtual instances, but it did not fix the problem. Profiles: 16, 32, and 64 nodes. This is a slightly different configuration, so these profiles are not directly comparable to the previous ones. But the non-scaling is pretty clear. E.g. the 64 node one has large gaps where very little seems to be happening on the machine.

E.g. the 64 node one has large gaps where very little seems to be happening on the machine.

The time it is taking to map your wrapper tasks is growing with the size of the machine. In particular I can see in the profiles that your wrapper tasks have hundreds (if not thousands of region requirements). Most importantly, the number of region requirements being requested by the runtime is scaling with the size of the machine. You need to make sure the number of region requirements used by the wrapper tasks is invariant to the size of the machine. As another performance optimization I would also set this to false in your mapper if you can tolerate it:
https://gitlab.com/StanfordLegion/legion/-/blob/master/runtime/legion/legion_mapping.h?ref_type=heads#L321

There is also an alternative implementation of virtual mappings here:
https://gitlab.com/StanfordLegion/legion/-/merge_requests/1195
but that will not help you at all if the number of region requirements passed to the wrapper task scales with the size of the machine.

You need to make sure the number of region requirements used by the wrapper tasks is invariant to the size of the machine.

I think this is unfortunately undoable in my use-case. I am weak scaling, so the number of sub-regions scales with the size of the machine, and as Elliott mentioned previously, I have to pass all those sub-regions into my wrapper tasks.

I have to pass all those sub-regions into my wrapper tasks.

I don't understand that at all. You should only have to pass the root regions in to get the privileges. You don't need to pass in the subregions. You can pick out the subregions from the parent region privileges when you launch the sub-tasks in the wrapper task.

My understanding is that that is not possible since the root regions will not fit on a single node (and we are using physical instances). Is that incorrect?

Try using the root regions with the new virtual mapping implementation.

Segfault on 16 nodes. This did not immediately reproduce in debug mode. I can try again tomorrow.

Since the error is not reproducing in debug mode, I collected some profiles: 16 nodes, 32 nodes. GPU utilization is low, as is expected in debug mode, but otherwise these profiles seem similar to the ones I posted in the beginning. In particular, the gap between the two wrapper tasks seems to increase with node count, which suggests this would not scale.

Try using the root regions with the new virtual mapping implementation.

Mike, if you'd like me to fuzz this implementation, let me know what privileges I should be testing with.

Segfault on 16 nodes. This did not immediately reproduce in debug mode. I can try again tomorrow.

I doubt this has anything to do with virtual mappings. This is a memory corruption of an STL data structure in the logical analysis. Run with valgrind and see if you can find where the memory corruption is coming from.

collected some profiles: 16 nodes, 32 nodes. GPU utilization is low, as is expected in debug mode, but otherwise these profiles seem similar to the ones I posted in the beginning.

Profiles in debug mode are never interesting. The timings are all warped because Legion is so STL heavy. I will say in these profiles there is some very strange blocking behavior in your wrapper task. It is blocking all the time and preventing Legion from getting ahead. Run with -lg:warn and fix all the warnings that it reports.

Mike, if you'd like me to fuzz this implementation, let me know what privileges I should be testing with.

You can fuzz if you want, but I already ran the fuzzer on this branch and didn't see anything. Memory corruption from the application is not something that will show up in the fuzzer.

You can fuzz if you want, but I already ran the fuzzer on this branch and didn't see anything.

Note that the newest fuzzer only tests read-only and reduction privileges with virtual instances, because your comment at https://gitlab.com/StanfordLegion/legion/-/merge_requests/1191 led me to believe those are the only ones that would work.

If you would like me to prepare (and/or test) a version of the fuzzer that supports read-write privileges on virtual instances, I can do that.

Memory corruption from the application is not something that will show up in the fuzzer.

Exactly. It will help us isolate potential runtime issues, making any remaining corruption even more likely to be in the application.

Note that the newest fuzzer only tests read-only and reduction privileges with virtual instances, because your comment at https://gitlab.com/StanfordLegion/legion/-/merge_requests/1191 led me to believe those are the only ones that would work.

All virtual mappings should always work. Whether they are performant or not is another story.

If you would like me to prepare (and/or test) a version of the fuzzer that supports read-write privileges on virtual instances, I can do that.

Go ahead and do that.

Fuzzer version StanfordLegion/fuzzer@d69fc33 restores support for the full spectrum of privileges in inner tasks (and therefore, virtual mappings).

With that fuzzer running on https://gitlab.com/StanfordLegion/legion/-/merge_requests/1195, the only errors I see are ones that you fixed via https://gitlab.com/StanfordLegion/legion/-/merge_requests/1196. I have not fuzzed this super extensively, but everything I see looks good.

Note that all of this is debug mode Legion, and therefore would not reproduce Rupanshu's issue. I'll try release mode next.

I fixed a bug in the restart implementation, and I can no longer reproduce this error in release or debug mode up to 32 nodes.

I hit a similar segfault in release mode on 64 nodes. Backtrace.

Edit: I hit what looks like basically the same segfault on 128 nodes as well.

Another segfault in an STL data structure. If you're not running with valgrind, now is the time to start.

FYI Valgrind is indeed flagging some tasks, but root causing the issue will take a significant amount of time and so will not get done before the SC deadline.

I just want to add to this: Rupanshu ran with (Regent) bounds checks as well. Bounds checks did not report anything. So we have a disagreement between Valigrind and bounds checks on this one.

There are lots of other kinds of writes that can be done inside tasks that aren't region accesses. All writes and not just region writes can lead to heap corruption.