STEllAR-GROUP/hpx

HPX hangs at the very end

JiakunYan opened this issue · 5 comments

Expected Behavior

Octo-Tiger exits successfully.

Actual Behavior

Sometimes HPX will hang after HPX:finalize()

It happened occasionally when I ran Octo-Tiger on Perlmutter. Although it does not prevent me from performing experiments, it does cause some waste of computation credits as the SLURM jobs will run to their time limit.

I happened to get the full stack trace today when this occurred again. I was running Octo-Tiger with 8 nodes and 128 localities per node (just to regenerate the dataset), so every HPX process only had 11 threads in total.

@hkaiser Could you take a look and see whether you can spot something obviously wrong from the stack trace?

info threads
(gdb) info threads
  Id   Target Id                                       Frame 
* 1    Thread 0x7f2a7d9f5000 (LWP 2306119) "octotiger" 0x00007f2aa99ed70c in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
  2    Thread 0x7f2a59636000 (LWP 2308582) "octotiger" 0x00007f2aa98f0999 in poll ()
   from /lib64/libc.so.6
  3    Thread 0x7f2a57fff000 (LWP 2308760) "octotiger" 0x00007f2aa99ed70c in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
  4    Thread 0x7f2a577fe000 (LWP 2308761) "octotiger" 0x00007f2aa99ed70c in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
  5    Thread 0x7f2a56ffd000 (LWP 2309162) "octotiger" 0x00007f2aa99ed70c in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
  6    Thread 0x7f2a567fc000 (LWP 2309163) "octotiger" 0x00007f2aa99ed70c in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
  7    Thread 0x7f2a55ffb000 (LWP 2309164) "octotiger" 0x00007f2aa99ed70c in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
  8    Thread 0x7f2a557fa000 (LWP 2309165) "octotiger" 0x00007f2aa99ed70c in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
  9    Thread 0x7f2a54ff9000 (LWP 2309166) "octotiger" 0x00007f2aa99eda5e in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
  10   Thread 0x7f2a547f8000 (LWP 2309167) "octotiger" 0x00007f2aa99eda5e in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
  11   Thread 0x7f2a539f7000 (LWP 2309809) "octotiger" 0x00007f2aa99ed70c in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
Thread 1 backtrace
(gdb) bt
#0  0x00007f2aa99ed70c in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00007f2ab2643afc in asio::detail::scheduler::run(std::error_code&) [clone .isra.0] ()
   from /global/u1/j/jackyan/workspace/spack/opt/spack/linux-sles15-zen3/gcc-12.3.0/hpx-master-plvhffpq7tdpmj7pbc2stxeb5zoxoztd/lib64/libhpx_core.so
#2  0x00007f2ab2644128 in hpx::util::io_service_pool::thread_run(unsigned long, hpx::util::barrier*) const ()
   from /global/u1/j/jackyan/workspace/spack/opt/spack/linux-sles15-zen3/gcc-12.3.0/hpx-master-plvhffpq7tdpmj7pbc2stxeb5zoxoztd/lib64/libhpx_core.so
#3  0x00007f2ab2dc12bd in hpx::runtime_distributed::wait() ()
   from /global/u1/j/jackyan/workspace/spack/opt/spack/linux-sles15-zen3/gcc-12.3.0/hpx-master-plvhffpq7tdpmj7pbc2stxeb5zoxoztd/lib64/libhpx.so.1
#4  0x00007f2ab2db5a03 in hpx::runtime_distributed::run() ()
   from /global/u1/j/jackyan/workspace/spack/opt/spack/linux-sles15-zen3/gcc-12.3.0/hpx-master-plvhffpq7tdpmj7pbc2stxeb5zoxoztd/lib64/libhpx.so.1
#5  0x00007f2ab2c8c1ea in hpx::detail::run_or_start(bool, std::unique_ptr<hpx::runtime, std::default_delete<hpx::runtime> >, hpx::util::command_line_handling&, hpx::move_only_function<void (), false>, hpx::move_only_function<void (), false>) ()
   from /global/u1/j/jackyan/workspace/spack/opt/spack/linux-sles15-zen3/gcc-12.3.0/hpx-master-plvhffpq7tdpmj7pbc2stxeb5zoxoztd/lib64/libhpx.so.1
#6  0x00007f2ab2c8c9da in hpx::detail::run_or_start(hpx::function<int (hpx::program_options::variables_map&), false> const&, int, char**, hpx::init_params const&, bool) ()
   from /global/u1/j/jackyan/workspace/spack/opt/spack/linux-sles15-zen3/gcc-12.3.0/hpx-master-plvhffpq7tdpmj7pbc2stxeb5zoxoztd/lib64/libhpx.so.1
#7  0x0000000000431f89 in main ()
Thread 2 backtrace
(gdb) thread 2
[Switching to thread 2 (Thread 0x7f2a59636000 (LWP 2308582))]
#0  0x00007f2aa98f0999 in poll () from /lib64/libc.so.6
(gdb) bt
#0  0x00007f2aa98f0999 in poll () from /lib64/libc.so.6
#1  0x00007f2aa6c4389a in poll (__timeout=-1, __nfds=1, __fds=0x7f2a596257b8)
    at /usr/include/bits/poll2.h:46
#2  ofi_uffd_handler (arg=<optimized out>) at prov/util/src/util_mem_monitor.c:515
#3  0x00007f2aa99e66ea in start_thread () from /lib64/libpthread.so.0
#4  0x00007f2aa98fd49f in clone () from /lib64/libc.so.6
Thread 3 backtrace
(gdb) thread 3
[Switching to thread 3 (Thread 0x7f2a57fff000 (LWP 2308760))]
#0  0x00007f2aa99ed70c in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
(gdb) bt
#0  0x00007f2aa99ed70c in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00007f2ab2643afc in asio::detail::scheduler::run(std::error_code&) [clone .isra.0] ()
   from /global/u1/j/jackyan/workspace/spack/opt/spack/linux-sles15-zen3/gcc-12.3.0/hpx-master-plvhffpq7tdpmj7pbc2stxeb5zoxoztd/lib64/libhpx_core.so
#2  0x00007f2ab2644128 in hpx::util::io_service_pool::thread_run(unsigned long, hpx::util::barrier*) const ()
   from /global/u1/j/jackyan/workspace/spack/opt/spack/linux-sles15-zen3/gcc-12.3.0/hpx-master-plvhffpq7tdpmj7pbc2stxeb5zoxoztd/lib64/libhpx_core.so
#3  0x00007f2aa9edcac3 in ?? () from /usr/lib64/libstdc++.so.6
#4  0x00007f2aa99e66ea in start_thread () from /lib64/libpthread.so.0
#5  0x00007f2aa98fd49f in clone () from /lib64/libc.so.6
Thread 4 backtrace
(gdb) thread 4
[Switching to thread 4 (Thread 0x7f2a577fe000 (LWP 2308761))]
#0  0x00007f2aa99ed70c in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
(gdb) bt
#0  0x00007f2aa99ed70c in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00007f2ab2643afc in asio::detail::scheduler::run(std::error_code&) [clone .isra.0] ()
   from /global/u1/j/jackyan/workspace/spack/opt/spack/linux-sles15-zen3/gcc-12.3.0/hpx-master-plvhffpq7tdpmj7pbc2stxeb5zoxoztd/lib64/libhpx_core.so
#2  0x00007f2ab2644128 in hpx::util::io_service_pool::thread_run(unsigned long, hpx::util::barrier*) const ()
   from /global/u1/j/jackyan/workspace/spack/opt/spack/linux-sles15-zen3/gcc-12.3.0/hpx-master-plvhffpq7tdpmj7pbc2stxeb5zoxoztd/lib64/libhpx_core.so
#3  0x00007f2aa9edcac3 in ?? () from /usr/lib64/libstdc++.so.6
#4  0x00007f2aa99e66ea in start_thread () from /lib64/libpthread.so.0
#5  0x00007f2aa98fd49f in clone () from /lib64/libc.so.6
Thread 5 backtrace
(gdb) thread 5
[Switching to thread 5 (Thread 0x7f2a56ffd000 (LWP 2309162))]
#0  0x00007f2aa99ed70c in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
(gdb) bt
#0  0x00007f2aa99ed70c in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00007f2ab2643afc in asio::detail::scheduler::run(std::error_code&) [clone .isra.0] ()
   from /global/u1/j/jackyan/workspace/spack/opt/spack/linux-sles15-zen3/gcc-12.3.0/hpx-master-plvhffpq7tdpmj7pbc2stxeb5zoxoztd/lib64/libhpx_core.so
#2  0x00007f2ab2644128 in hpx::util::io_service_pool::thread_run(unsigned long, hpx::util::barrier*) const ()
   from /global/u1/j/jackyan/workspace/spack/opt/spack/linux-sles15-zen3/gcc-12.3.0/hpx-master-plvhffpq7tdpmj7pbc2stxeb5zoxoztd/lib64/libhpx_core.so
#3  0x00007f2aa9edcac3 in ?? () from /usr/lib64/libstdc++.so.6
#4  0x00007f2aa99e66ea in start_thread () from /lib64/libpthread.so.0
#5  0x00007f2aa98fd49f in clone () from /lib64/libc.so.6
Thread 6 backtrace
(gdb) thread 6
[Switching to thread 6 (Thread 0x7f2a567fc000 (LWP 2309163))]
#0  0x00007f2aa99ed70c in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
(gdb) bt
#0  0x00007f2aa99ed70c in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00007f2ab2643afc in asio::detail::scheduler::run(std::error_code&) [clone .isra.0] ()
   from /global/u1/j/jackyan/workspace/spack/opt/spack/linux-sles15-zen3/gcc-12.3.0/hpx-master-plvhffpq7tdpmj7pbc2stxeb5zoxoztd/lib64/libhpx_core.so
#2  0x00007f2ab2644128 in hpx::util::io_service_pool::thread_run(unsigned long, hpx::util::barrier*) const ()
   from /global/u1/j/jackyan/workspace/spack/opt/spack/linux-sles15-zen3/gcc-12.3.0/hpx-master-plvhffpq7tdpmj7pbc2stxeb5zoxoztd/lib64/libhpx_core.so
#3  0x00007f2aa9edcac3 in ?? () from /usr/lib64/libstdc++.so.6
#4  0x00007f2aa99e66ea in start_thread () from /lib64/libpthread.so.0
#5  0x00007f2aa98fd49f in clone () from /lib64/libc.so.6
Thread 7 backtrace
(gdb) thread 7
[Switching to thread 7 (Thread 0x7f2a55ffb000 (LWP 2309164))]
#0  0x00007f2aa99ed70c in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
(gdb) bt
#0  0x00007f2aa99ed70c in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00007f2ab2643afc in asio::detail::scheduler::run(std::error_code&) [clone .isra.0] ()
   from /global/u1/j/jackyan/workspace/spack/opt/spack/linux-sles15-zen3/gcc-12.3.0/hpx-master-plvhffpq7tdpmj7pbc2stxeb5zoxoztd/lib64/libhpx_core.so
#2  0x00007f2ab2644128 in hpx::util::io_service_pool::thread_run(unsigned long, hpx::util::barrier*) const ()
   from /global/u1/j/jackyan/workspace/spack/opt/spack/linux-sles15-zen3/gcc-12.3.0/hpx-master-plvhffpq7tdpmj7pbc2stxeb5zoxoztd/lib64/libhpx_core.so
#3  0x00007f2aa9edcac3 in ?? () from /usr/lib64/libstdc++.so.6
#4  0x00007f2aa99e66ea in start_thread () from /lib64/libpthread.so.0
#5  0x00007f2aa98fd49f in clone () from /lib64/libc.so.6
Thread 8 backtrace
(gdb) thread 8
[Switching to thread 8 (Thread 0x7f2a557fa000 (LWP 2309165))]
#0  0x00007f2aa99ed70c in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
(gdb) bt
#0  0x00007f2aa99ed70c in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00007f2ab2643afc in asio::detail::scheduler::run(std::error_code&) [clone .isra.0] ()
   from /global/u1/j/jackyan/workspace/spack/opt/spack/linux-sles15-zen3/gcc-12.3.0/hpx-master-plvhffpq7tdpmj7pbc2stxeb5zoxoztd/lib64/libhpx_core.so
#2  0x00007f2ab2644128 in hpx::util::io_service_pool::thread_run(unsigned long, hpx::util::barrier*) const ()
   from /global/u1/j/jackyan/workspace/spack/opt/spack/linux-sles15-zen3/gcc-12.3.0/hpx-master-plvhffpq7tdpmj7pbc2stxeb5zoxoztd/lib64/libhpx_core.so
#3  0x00007f2aa9edcac3 in ?? () from /usr/lib64/libstdc++.so.6
#4  0x00007f2aa99e66ea in start_thread () from /lib64/libpthread.so.0
#5  0x00007f2aa98fd49f in clone () from /lib64/libc.so.6
Thread 9 backtrace
(gdb) thread 9
[Switching to thread 9 (Thread 0x7f2a54ff9000 (LWP 2309166))]
#0  0x00007f2aa99eda5e in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
(gdb) bt
#0  0x00007f2aa99eda5e in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00007f2ab275ae25 in hpx::threads::policies::scheduler_base::idle_callback(unsigned long)
    ()
   from /global/u1/j/jackyan/workspace/spack/opt/spack/linux-sles15-zen3/gcc-12.3.0/hpx-master-plvhffpq7tdpmj7pbc2stxeb5zoxoztd/lib64/libhpx_core.so
#2  0x00007f2ab2742c1c in void hpx::threads::detail::scheduling_loop<hpx::threads::policies::local_priority_queue_scheduler<std::mutex, hpx::threads::policies::lockfree_fifo, hpx::threads::policies::lockfree_fifo, hpx::threads::policies::lockfree_lifo> >(unsigned long, hpx::threads::policies::local_priority_queue_scheduler<std::mutex, hpx::threads::policies::lockfree_fifo, hpx::threads::policies::lockfree_fifo, hpx::threads::policies::lockfree_lifo>&, hpx::threads::detail::scheduling_counters&, hpx::threads::detail::scheduling_callbacks&) ()
   from /global/u1/j/jackyan/workspace/spack/opt/spack/linux-sles15-zen3/gcc-12.3.0/hpx-master-plvhffpq7tdpmj7pbc2stxeb5zoxoztd/lib64/libhpx_core.so
#3  0x00007f2ab2743584 in hpx::threads::detail::scheduled_thread_pool<hpx::threads::policies::local_priority_queue_scheduler<std::mutex, hpx::threads::policies::lockfree_fifo, hpx::threads::policies::lockfree_fifo, hpx::threads::policies::lockfree_lifo> >::thread_func(unsigned long, unsigned long, std::shared_ptr<hpx::util::barrier>) ()
   from /global/u1/j/jackyan/workspace/spack/opt/spack/linux-sles15-zen3/gcc-12.3.0/hpx-master-plvhffpq7tdpmj7pbc2stxeb5zoxoztd/lib64/libhpx_core.so
#4  0x00007f2ab26e8f63 in std::thread::_State_impl<std::thread::_Invoker<std::tuple<void (hpx::threads::detail::scheduled_thread_pool<hpx::threads::policies::local_priority_queue_scheduler<std::mutex, hpx::threads::policies::lockfree_fifo, hpx::threads::policies::lockfree_fifo, hpx::threads::policies::lockfree_lifo> >::*)(unsigned long, unsigned long, std::shared_ptr<hpx::util::barrier>), hpx::threads::detail::scheduled_thread_pool<hpx::threads::policies::local_priority_queue_scheduler<std::mutex, hpx::threads::policies::lockfree_fifo, hpx::threads::policies::lockfree_fifo, hpx::threads::policies::lockfree_lifo> >*, unsigned long, unsigned long, std::shared_ptr<hpx::util::barrier> > > >::_M_run() ()
   from /global/u1/j/jackyan/workspace/spack/opt/spack/linux-sles15-zen3/gcc-12.3.0/hpx-master-plvhffpq7tdpmj7pbc2stxeb5zoxoztd/lib64/libhpx_core.so
#5  0x00007f2aa9edcac3 in ?? () from /usr/lib64/libstdc++.so.6
#6  0x00007f2aa99e66ea in start_thread () from /lib64/libpthread.so.0
#7  0x00007f2aa98fd49f in clone () from /lib64/libc.so.6
Thread 10 backtrace
(gdb) thread 10
[Switching to thread 10 (Thread 0x7f2a547f8000 (LWP 2309167))]
#0  0x00007f2aa99eda5e in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
(gdb) bt
#0  0x00007f2aa99eda5e in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00007f2ab275ae25 in hpx::threads::policies::scheduler_base::idle_callback(unsigned long)
    ()
   from /global/u1/j/jackyan/workspace/spack/opt/spack/linux-sles15-zen3/gcc-12.3.0/hpx-master-plvhffpq7tdpmj7pbc2stxeb5zoxoztd/lib64/libhpx_core.so
#2  0x00007f2ab2742c1c in void hpx::threads::detail::scheduling_loop<hpx::threads::policies::local_priority_queue_scheduler<std::mutex, hpx::threads::policies::lockfree_fifo, hpx::threads::policies::lockfree_fifo, hpx::threads::policies::lockfree_lifo> >(unsigned long, hpx::threads::policies::local_priority_queue_scheduler<std::mutex, hpx::threads::policies::lockfree_fifo, hpx::threads::policies::lockfree_fifo, hpx::threads::policies::lockfree_lifo>&, hpx::threads::detail::scheduling_counters&, hpx::threads::detail::scheduling_callbacks&) ()
   from /global/u1/j/jackyan/workspace/spack/opt/spack/linux-sles15-zen3/gcc-12.3.0/hpx-master-plvhffpq7tdpmj7pbc2stxeb5zoxoztd/lib64/libhpx_core.so
#3  0x00007f2ab2743584 in hpx::threads::detail::scheduled_thread_pool<hpx::threads::policies::local_priority_queue_scheduler<std::mutex, hpx::threads::policies::lockfree_fifo, hpx::threads::policies::lockfree_fifo, hpx::threads::policies::lockfree_lifo> >::thread_func(unsigned long, unsigned long, std::shared_ptr<hpx::util::barrier>) ()
   from /global/u1/j/jackyan/workspace/spack/opt/spack/linux-sles15-zen3/gcc-12.3.0/hpx-master-plvhffpq7tdpmj7pbc2stxeb5zoxoztd/lib64/libhpx_core.so
#4  0x00007f2ab26e8f63 in std::thread::_State_impl<std::thread::_Invoker<std::tuple<void (hpx::threads::detail::scheduled_thread_pool<hpx::threads::policies::local_priority_queue_scheduler<std::mutex, hpx::threads::policies::lockfree_fifo, hpx::threads::policies::lockfree_fifo, hpx::threads::policies::lockfree_lifo> >::*)(unsigned long, unsigned long, std::shared_ptr<hpx::util::barrier>), hpx::threads::detail::scheduled_thread_pool<hpx::threads::policies::local_priority_queue_scheduler<std::mutex, hpx::threads::policies::lockfree_fifo, hpx::threads::policies::lockfree_fifo, hpx::threads::policies::lockfree_lifo> >*, unsigned long, unsigned long, std::shared_ptr<hpx::util::barrier> > > >::_M_run() ()
   from /global/u1/j/jackyan/workspace/spack/opt/spack/linux-sles15-zen3/gcc-12.3.0/hpx-master-plvhffpq7tdpmj7pbc2stxeb5zoxoztd/lib64/libhpx_core.so
#5  0x00007f2aa9edcac3 in ?? () from /usr/lib64/libstdc++.so.6
#6  0x00007f2aa99e66ea in start_thread () from /lib64/libpthread.so.0
#7  0x00007f2aa98fd49f in clone () from /lib64/libc.so.6
Thread 11 backtrace
(gdb) thread 11
[Switching to thread 11 (Thread 0x7f2a539f7000 (LWP 2309809))]
#0  0x00007f2aa99ed70c in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
(gdb) bt
#0  0x00007f2aa99ed70c in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00007f2ab2dc0db2 in hpx::components::server::runtime_support::wait() ()
   from /global/u1/j/jackyan/workspace/spack/opt/spack/linux-sles15-zen3/gcc-12.3.0/hpx-master-plvhffpq7tdpmj7pbc2stxeb5zoxoztd/lib64/libhpx.so.1
#2  0x00007f2ab2dc1016 in hpx::runtime_distributed::wait_helper(std::mutex&, std::condition_variable&, bool&) ()
   from /global/u1/j/jackyan/workspace/spack/opt/spack/linux-sles15-zen3/gcc-12.3.0/hpx-master-plvhffpq7tdpmj7pbc2stxeb5zoxoztd/lib64/libhpx.so.1
#3  0x00007f2aa9edcac3 in ?? () from /usr/lib64/libstdc++.so.6
#4  0x00007f2aa99e66ea in start_thread () from /lib64/libpthread.so.0
#5  0x00007f2aa98fd49f in clone () from /lib64/libc.so.6

Specifications

  • HPX Version: The current master branch
  • Platform (compiler, OS): Perlmutter

@hkaiser Could you take a look and see whether you can spot something obviously wrong from the stack trace?

Nothing unusual. Two HPX scheduler threads seem to be still active, but none of those seem to have any work left (they sit in the idle_callback).

Nothing unusual. Two HPX scheduler threads seem to be still active, but none of those seem to have any work left (they sit in the idle_callback).

So why is the network background function not invoked?

Nothing unusual. Two HPX scheduler threads seem to be still active, but none of those seem to have any work left (they sit in the idle_callback).

So why is the network background function not invoked?

Are you sure there are messages pending or potentially incoming?

Nothing unusual. Two HPX scheduler threads seem to be still active, but none of those seem to have any work left (they sit in the idle_callback).

So why is the network background function not invoked?

Are you sure there are messages pending or potentially incoming?

I am not sure if there are pending messages or potentially incoming messages, but the process should not be sure either, unless the dijkstra termination algorithm has completed? The network background function is the way to check whether there are incoming messages and make progress on the pending ones. My understanding is that the worker threads will keep calling the network background function when idle until the dijkstra termination concludes.

Close for now until I get more useful data.