Realm: set_bit id assertion on Summit
syamajala opened this issue · 5 comments
syamajala commented
I'm seeing the following assertion with S3D on Summit:
s3d.x: /gpfs/alpine/cmb103/scratch/seshuy/legion_s3d_nscbc/legion/runtime/realm/nodeset.cc:327: size_t Realm::NodeSetBitmask::set_bit(Realm::NodeID): Assertion `(id >= 0) && (id <= max_node_id)' failed.
Here is a stack trace:
[1] Thread 1 (Thread 0x20000a3b9890 (LWP 177674)):
[1] #0 0x0000200000a09ca0 in waitpid () from /lib64/power9/libc.so.6
[1] #1 0x0000200000974340 in do_system () from /lib64/power9/libc.so.6
[1] #2 0x00002000008e8ec8 in system_compat () from /lib64/power9/libpthread.so.0
[1] #3 0x0000200008237320 in gasneti_system_redirected () from /gpfs/alpine/cmb103/scratch/seshuy/legion_s3d_nscbc/legion/language/build/lib/librealm.so.1
[1] #4 0x0000200008237a8c in gasneti_bt_gdb () from /gpfs/alpine/cmb103/scratch/seshuy/legion_s3d_nscbc/legion/language/build/lib/librealm.so.1
[1] #5 0x000020000823caf4 in gasneti_print_backtrace () from /gpfs/alpine/cmb103/scratch/seshuy/legion_s3d_nscbc/legion/language/build/lib/librealm.so.1
[1] #6 0x000020000823d09c in _gasneti_print_backtrace_ifenabled () from /gpfs/alpine/cmb103/scratch/seshuy/legion_s3d_nscbc/legion/language/build/lib/librealm.so.1
[1] #7 0x00002000071a63a0 in gasneti_defaultSignalHandler () from /gpfs/alpine/cmb103/scratch/seshuy/legion_s3d_nscbc/legion/language/build/lib/librealm.so.1
[1] #8 <signal handler called>
[1] #9 0x0000200000963618 in raise () from /lib64/power9/libc.so.6
[1] #10 0x0000200000943a2c in abort () from /lib64/power9/libc.so.6
[1] #11 0x0000200000956f70 in __assert_fail_base () from /lib64/power9/libc.so.6
[1] #12 0x0000200000957014 in __assert_fail () from /lib64/power9/libc.so.6
[1] #13 0x00002000077d7a58 in Realm::NodeSetBitmask::set_bit (this=0x20463a58, id=0) at /gpfs/alpine/cmb103/scratch/seshuy/legion_s3d_nscbc/legion/runtime/realm/nodeset.cc:327
[1] #14 0x00002000077d7460 in Realm::NodeSet::convert_to_bitmask (this=0x200008a4d448 <Realm::Network::shared_peers>) at /gpfs/alpine/cmb103/scratch/seshuy/legion_s3d_nscbc/legion/runtime/realm/nodeset.cc:225
[1] #15 0x00002000071c8c78 in Realm::NodeSet::add (this=0x200008a4d448 <Realm::Network::shared_peers>, id=5) at /gpfs/alpine/cmb103/scratch/seshuy/legion_s3d_nscbc/legion/runtime/realm/nodeset.inl:154
[1] #16 0x00002000079b0910 in Realm::GASNetEXInternal::init (this=0x2006b4f0, argc=0x7fffce7478c0, argv=0x7fffce7478c8) at /gpfs/alpine/cmb103/scratch/seshuy/legion_s3d_nscbc/legion/runtime/realm/gasnetex/gasnetex_internal.cc:3192
[1] #17 0x00002000079a027c in Realm::GASNetEXModule::create_network_module (runtime=0x2006a0f0, argc=0x7fffce7478c0, argv=0x7fffce7478c8) at /gpfs/alpine/cmb103/scratch/seshuy/legion_s3d_nscbc/legion/runtime/realm/gasnetex/gasnetex_module.cc:470
[1] #18 0x00002000077d4fd0 in Realm::ModuleRegistrar::NetworkRegistration<Realm::GASNetEXModule>::create_network_module (this=0x200008a4b590 <registration_93>, runtime=0x2006a0f0, argc=0x7fffce7478c0, argv=0x7fffce7478c8) at /gpfs/alpine/cmb103/scratch/seshuy/legion_s3d_nscbc/legion/runtime/realm/module.h:183
[1] #19 0x00002000079d7d24 in Realm::ModuleRegistrar::create_network_modules (this=0x2006b408, modules=..., argc=0x7fffce7478c0, argv=0x7fffce7478c8) at /gpfs/alpine/cmb103/scratch/seshuy/legion_s3d_nscbc/legion/runtime/realm/network.cc:709
[1] #20 0x0000200007846934 in Realm::RuntimeImpl::network_init (this=0x2006a0f0, argc=0x7fffce749a10, argv=0x7fffce749a18) at /gpfs/alpine/cmb103/scratch/seshuy/legion_s3d_nscbc/legion/runtime/realm/runtime_impl.cc:1294
[1] #21 0x0000200007840518 in Realm::Runtime::network_init (this=0x7fffce748670, argc=0x7fffce749a10, argv=0x7fffce749a18) at /gpfs/alpine/cmb103/scratch/seshuy/legion_s3d_nscbc/legion/runtime/realm/runtime_impl.cc:453
[1] #22 0x000020000535e54c in Legion::Internal::Runtime::initialize (argc=0x7fffce749a10, argv=0x7fffce749a18, parse=true, filter=false) at /gpfs/alpine/cmb103/scratch/seshuy/legion_s3d_nscbc/legion/runtime/legion/runtime.cc:30051
[1] #23 0x000020000535dbbc in Legion::Internal::Runtime::start (argc=34, argv=0x1f397600, background=true, supply_default_mapper=true, filter=false) at /gpfs/alpine/cmb103/scratch/seshuy/legion_s3d_nscbc/legion/runtime/legion/runtime.cc:29920
[1] #24 0x0000200004c3b7a0 in Legion::Runtime::start (argc=34, argv=0x1f397600, background=true, default_mapper=true, filter=false) at /gpfs/alpine/cmb103/scratch/seshuy/legion_s3d_nscbc/legion/runtime/legion/legion.cc:7680
[1] #25 0x0000200000116b54 in S3DRank::start_legion() () from /gpfs/alpine/cmb103/scratch/seshuy/legion_s3d_nscbc/Ammonia_Cases/pwave_x_1_hept/librhsf.so
[1] #26 0x0000200000111c40 in initialize_rhsf_legion_ () from /gpfs/alpine/cmb103/scratch/seshuy/legion_s3d_nscbc/Ammonia_Cases/pwave_x_1_hept/librhsf.so
[1] #27 0x00000000101bba40 in solve_driver (io=6) at /gpfs/alpine/cmb103/scratch/seshuy/legion_s3d_nscbc/s3d/source/drivers/solve_driver.f90:194
[1] #28 0x00000000101bb1c8 in s3d () at /gpfs/alpine/cmb103/scratch/seshuy/legion_s3d_nscbc/s3d/source/drivers/main.f90:131
[1] #29 0x0000000010003994 in main (argc=<optimized out>, argv=<optimized out>) at /gpfs/alpine/cmb103/scratch/seshuy/legion_s3d_nscbc/s3d/source/drivers/main.f90:8
[1] #30 0x0000200000944078 in generic_start_main.isra () from /lib64/power9/libc.so.6
[1] #31 0x0000200000944264 in __libc_start_main () from /lib64/power9/libc.so.6
[1] #32 0x0000000000000000 in ?? ()
@eddy16112 gave me a work around for now which is to comment out these lines: https://gitlab.com/StanfordLegion/legion/-/blob/d5660934f40f5f1f9c5dd1aeee30016a4e4065d8/runtime/realm/gasnetex/gasnetex_internal.cc#L3189-3194
lightsighter commented
@muraj is already backing out the change that caused this.
muraj commented
This is actually an issue with the shared_peers NodeSet due to the nodesetbitmask::allocator being initialized after it's use. I have a change to correct that pending review.