StanfordLegion/legion

Freeze in S3D on attach_name

Closed this issue · 5 comments

I am seeing a freeze in S3D on attach_name at 4 nodes. This is running a very slightly modified version of S3D relative to #1651: the only difference is that I added index launches to some of the tasks that create partitions. Note that I have touched nothing that should directly impact an attach_name call—only, perhaps, the order of such calls. Everything else is otherwise the same as it was as of the time of resolution for #1651.

I am using master at fc1607c and have tried various combinations of flags:

  • -lg:safe_ctrlrepl 1 -lg:safe_mapper
  • -lg:partcheck
  • -lg:inorder

None of the checks return any errors, and -lg:inorder does not fix the hang.

Here are backtraces from a run with -lg:inorder -ll:force_kthreads:

For posterity, the logs (they are nearly empty):

What can I do to debug this?

There's enough information in the backtraces for me to debug this later.

I'm working around the issue with the following patch. So far I'm good up to 512 nodes and performance has been rock solid.

diff --git a/runtime/legion/legion_c.cc b/runtime/legion/legion_c.cc
index d72e68fb4..0e906534a 100644
--- a/runtime/legion/legion_c.cc
+++ b/runtime/legion/legion_c.cc
@@ -1998,7 +1998,7 @@ legion_index_partition_attach_name(legion_runtime_t runtime_,
   Runtime *runtime = CObjectWrapper::unwrap(runtime_);
   IndexPartition handle = CObjectWrapper::unwrap(handle_);

-  runtime->attach_name(handle, name, is_mutable);
+  // runtime->attach_name(handle, name, is_mutable);
 }

 void
@@ -2606,7 +2606,7 @@ legion_logical_partition_attach_name(legion_runtime_t runtime_,
   Runtime *runtime = CObjectWrapper::unwrap(runtime_);
   LogicalPartition handle = CObjectWrapper::unwrap(handle_);

-  runtime->attach_name(handle, name, is_mutable);
+  // runtime->attach_name(handle, name, is_mutable);
 }

 void

Avoiding attaches on logical partitions and logical regions that are not the top-level logical region should be sufficient to make things work. I understand the cause of the problem. Attaches on those kinds of resources are just broken right now in distributed settings.

Frontier is down today but I confirmed on up to 16 nodes of Perlmutter. Thanks!