StanfordLegion/legion

Legion: Modified Circuit weak scaling

rupanshusoi opened this issue · 8 comments

I'm working with a modified version of Circuit in which the main timestep loop has been lifted into a wrapper task. This wrapper task, along with the top-level task are both control replicated.

The issue is that this version of Circuit does not seem to be tracing, as evidenced by the absence of tracing-related meta-tasks in the profiles below, and consequently does not weak scale to even 16 nodes.

Here is a profile on 8 nodes, where the weak scaling performance is acceptable but not great, here is a profile on 16 nodes, where performance is worse, and here is one on 32 nodes where performance is worse still. The last profile includes the first 8 nodes only.

I'm running with the following Legion options:

... -hl:sched 1024 1 -ll:gpu 1 -ll:io 1 -ll:util 2 -ll:bgwork 4 -ll:csize 15000 -ll:fsize 15000 -ll:zsize 15000 -ll:rsize 0 -ll:gsize 0 -lg:eager_alloc_percentage 10 -dm:same_address_space 1 

if you're using the default mapper, you also need to pass -dm:memoize. it also doesn't seem like the lack of tracing here is the reason that it is not weak-scaling as you'd expect -- the main difference in the profiles I see is that the amount of channel utilization doubles from 16 to 32 nodes.

What does high channel utilization usually indicate? A bad mapping strategy?

Channel utilization corresponds to time spend communicating. At 8 nodes the code is doing copies that take ~2ms between calls to calculate new currents to taking ~16ms of communication at 32 nodes.

Here are new profiles with -dm:memoize 1: 8, 16 and 32 nodes. Performance is better but still not great.

At 32 nodes there are a few gaps in the profile where nothing seems to be happening: 20.9-21.5 s, 23.5-23.8 s, 37.9-38.2 s. These gaps could be the source of the performance hit, so I'm trying to understand why they exist and how they can be mitigated.

Average channel utilization seems to jump up on going from 16 to 32 nodes, but that seems to be a profiler bug or something like that, since individually every channel's average utilization is only ~10%.

For what it's worth, when processing the profile for the 16 and 32 node runs, I received this warning:

WARNING: A significant number of long latency messages were detected during this run meaning that the network was likely congested and could be causing a significant performance degredation. We detected 3865 messages that took longer than 1000.00us to run, representing 7.77% of 49740 total messages. The longest latency message required 35524.44us to execute. Please report this case to the Legion developers along with an accompanying Legion Prof profile so we can better understand why the network is so congested.

Do you have profiles of the non-modified circuit weak scaling well?

The long messages might be related to those gaps, but I don't have a good idea of why they are appearing in the first place. What machine is this?

Average channel utilization seems to jump up on going from 16 to 32 nodes, but that seems to be a profiler bug or something like that, since individually every channel's average utilization is only ~10%.

but if multiple nodes are using their channels at the same time, then the overall utilization will increase right

The unmodified Circuit is the code we've scaled in all of our experiments, and I believe works as expected. The runs are from Perlmutter as far as I understand.

There are two kinds of changes here:

  1. An additional level of task hierarchy (the top-level task calls "wrapper" tasks that each execute some ~100 timesteps of the application main loop; both top-level and wrapper tasks are control replicated)
  2. The wrapper tasks make a copy of region data for backup purposes and copy-in and out

I suppose we could try to narrow down the issue by making a version that does (1) and not (2). That could be a nontrivial amount of work though. @rupanshusoi?

It's actually easy. Runs are in queue now.

Have you written a custom mapper for this code yet? The last time I looked at this code for @rupanshusoi, it was still relying on the circuit mapper that came with the code and this part:

An additional level of task hierarchy (the top-level task calls "wrapper" tasks that each execute some ~100 timesteps of the application main loop; both top-level and wrapper tasks are control replicated)

was causing pretty much every single task in the second level of the hierarchy to be mapped remotely from where it was sharded (because the circuit mapper isn't "smart" enough to understand this pattern) and that was thereby causing unnecessary data movement for nearly every task that was being run. Tasks were just be randomly sprayed across the machine in a completely incoherent fashion as the mapper tried to "load balance" the tasks it was seeing across all the processors. It was a good fuzzing test for the runtime, but the mapping was awful. If you haven't made sure that all the tasks are being sharded correctly and then locally mapped first then I'm not surprised at all that things are not scaling.