StanfordLegion/legion

Legion: S3D poor scaling on Frontier

Closed this issue · 8 comments

We're seeing some weird issues with S3D on Frontier where timesteps 11 - 20 and 21 - 30 are taking significantly longer than the rest. At 2048 nodes for example 11 - 20 took 28 minutes but then 31 - 40 took ~1 minute.

There is a profile here at 1024 nodes: https://legion.stanford.edu/prof-viewer/?url=https://sapling2.stanford.edu/~seshu/s3d_subrank_master/pwave_x_1024_hept/legion_prof/

You can see timesteps 1 - 10 from 240 - 340 seconds. Timesteps 31 - 40 are from 1542 - 1562 seconds. The ones in between are more difficult to tell, but they should be marked be AwaitMPITask and AwaitMPITaskEarly on the cpu.

Profiles at other scales are available here: http://sapling2.stanford.edu/~seshu/s3d_subrank_master/

I think we need to see all the nodes. Most likely something bad is happening on just one or a few nodes.

If anything, it looks like the trace capture operation takes a very long time to finish on some of the smaller node counts. Have you tried this with the non-idempotent-traces branch?

If you can capture backtraces when the trace capture operation is running that would be interesting to see what it is doing. It shouldn't be too hard since even on four nodes its taking upwards of four minutes.

I did try non-idempotent-traces as well and was seeing weird behavior. I will do another run of that with this version of s3d and generate some profiles.

I'll also see if I can generate a profile of all nodes using one of the smaller runs since this behavior is there in the smaller node counts and get stack traces.

There are profiles from non-idempotent-traces here: http://sapling2.stanford.edu/~seshu/s3d_subrank_nonidempotent_traces/

I merged a change to make the Regent compiler cache projection functors in https://gitlab.com/StanfordLegion/legion/-/merge_requests/1167

I was able to run 2048 nodes with 3 minutes from timesteps 10-20 and 5 minutes from 20-30, numbers that are basically flat with node count.

I'll do some more runs to confirm this fully resolves the issue at scale.

Confirmed resolved in fc1607c.