StanfordLegion/legion

A question about Control Replication shards

97CBR opened this issue · 4 comments

@elliottslaughter
After reading Control Replication (SC 2017) [PDF] and Dynamic Control Replication (PPoPP 2021) [PDF] papers

I want to shard the following stencil into 0, 1, 2, 3, 4, and 5 shards. I want to know if, for example when computing the second or third layer iteration of shard 1, there will only be a dependency on shards 0 and 2, and not on shards 3, 4, and 5.
Therefore, when executing the iteration of shard 1, is it not necessary to synchronize with shards 3, 4, and 5? This means that if shards 3, 4, and 5 are computing slower for some reason, can shards 0 and 1 enter the subsequent layers of iteration faster?

If not, what are the typical scenarios where Control Replication can improve performance? Is it tasks without dependency relationships in loops?

image

Control replication, whether static or dynamic, should never change the dependencies in a program. If it does, then that is a bug. The intention is to run exactly the same program, just more efficiently.

The efficiency comes from a bottleneck that occurs in task-based systems without control replication. Without some form of CR, the main (or top-level) task runs on exactly one node. Therefore, one node is responsible for the dynamic analysis of dependencies, and for issuing work to other tasks. That node becomes a bottleneck when you scale out.

In CR, our solution is to execute the main task on multiple nodes. Each node becomes responsible for a subset of the computation. However, the dependency graph is the same. This is the magic of CR: the nodes must somehow collaboratively construct the same dependence graph, despite doing so in a distributed manner.

If not, what are the typical scenarios where Control Replication can improve performance?

I'll just add one thing to the above answer with regards to performance. Control replication is a program transformation and not a program optimization. The transformation improves the scalability of the program, not the peak performance that can be achieved on a given machine configuration. In many cases you might beat performance of a task-based system that has a single controller node and suffers from a sequential bottleneck spawning tasks, but compared to the speed-of-light performance of the same application written by hand in MPI, there's nothing inherent that control replication does that will improve performance over that, just makes your life considerably easier than having to do that.

@97CBR Do you have any further questions?

I have no more questions. Thank you for your patient replies. Thank you very much.