Freeze in S3D on attach_name

Question

Freeze in S3D on attach_name

Closed this issue 5 months ago · 5 comments

I am seeing a freeze in S3D on attach_name at 4 nodes. This is running a very slightly modified version of S3D relative to #1651: the only difference is that I added index launches to some of the tasks that create partitions. Note that I have touched nothing that should directly impact an attach_name call—only, perhaps, the order of such calls. Everything else is otherwise the same as it was as of the time of resolution for #1651.

I am using master at fc1607c and have tried various combinations of flags:

-lg:safe_ctrlrepl 1 -lg:safe_mapper
-lg:partcheck
-lg:inorder

None of the checks return any errors, and -lg:inorder does not fix the hang.

Here are backtraces from a run with -lg:inorder -ll:force_kthreads:

For posterity, the logs (they are nearly empty):

What can I do to debug this?

Answer 1 · 2024-03-18T20:10:12.000Z

There's enough information in the backtraces for me to debug this later.

Answer 2 · 2024-03-19T04:28:46.000Z

I'm working around the issue with the following patch. So far I'm good up to 512 nodes and performance has been rock solid.

diff --git a/runtime/legion/legion_c.cc b/runtime/legion/legion_c.cc
index d72e68fb4..0e906534a 100644
--- a/runtime/legion/legion_c.cc
+++ b/runtime/legion/legion_c.cc
@@ -1998,7 +1998,7 @@ legion_index_partition_attach_name(legion_runtime_t runtime_,
   Runtime *runtime = CObjectWrapper::unwrap(runtime_);
   IndexPartition handle = CObjectWrapper::unwrap(handle_);

-  runtime->attach_name(handle, name, is_mutable);
+  // runtime->attach_name(handle, name, is_mutable);
 }

 void
@@ -2606,7 +2606,7 @@ legion_logical_partition_attach_name(legion_runtime_t runtime_,
   Runtime *runtime = CObjectWrapper::unwrap(runtime_);
   LogicalPartition handle = CObjectWrapper::unwrap(handle_);

-  runtime->attach_name(handle, name, is_mutable);
+  // runtime->attach_name(handle, name, is_mutable);
 }

 void

Answer 3 · 2024-03-19T04:30:54.000Z

Avoiding attaches on logical partitions and logical regions that are not the top-level logical region should be sufficient to make things work. I understand the cause of the problem. Attaches on those kinds of resources are just broken right now in distributed settings.

Answer 4 · 2024-03-19T08:51:41.000Z

@elliottslaughter Try this branch and see if it fixes things for you:
https://gitlab.com/StanfordLegion/legion/-/merge_requests/1174

Answer 5 · 2024-03-19T18:40:02.000Z

Frontier is down today but I confirmed on up to 16 nodes of Perlmutter. Thanks!