open-mpi/ompi

MPI_THREAD_MULTIPLE support broken

ompiteam opened this issue · 10 comments

Support for MPI_THREAD_MULTIPLE is currently broken on trunk. Even very simple programs that just do an MPI_Init_thread with MPI_THREAD_MULTIPLE and then call MPI_Barrier() will hang.

Example of a backtrace when --mpi-preconnect_mpi 1 is passed to mpirun

(gdb) bt
#0  0x0000008039720d6c in .pthread_cond_wait () from /lib64/power6/libpthread.so.0
#1  0x00000400001299d8 in opal_condition_wait (c=0x400004763f8, m=0x40000476460)
    at ../../ompi-trunk.chris2/opal/threads/condition.h:79
#2  0x000004000012a08c in ompi_request_default_wait_all (count=2, requests=0xfffffa9db20, 
    statuses=0x0) at ../../ompi-trunk.chris2/ompi/request/req_wait.c:281
#3  0x000004000012f56c in ompi_init_preconnect_mpi ()
    at ../../ompi-trunk.chris2/ompi/runtime/ompi_mpi_preconnect.c:72
#4  0x000004000012c738 in ompi_mpi_init (argc=1, argv=0xfffffa9f278, requested=3, 
    provided=0xfffffa9edd8) at ../../ompi-trunk.chris2/ompi/runtime/ompi_mpi_init.c:800
#5  0x000004000017a064 in PMPI_Init_thread (argc=0xfffffa9ee20, argv=0xfffffa9ee28, required=3, 
    provided=0xfffffa9edd8) at pinit_thread.c:84
#6  0x0000000010000ae4 in main (argc=1, argv=0xfffffa9f278) at test2.c:15

Running without MPI_THREAD_MULTIPLE but against the same build works fine.

I think the problem is due to some changes between 1.6 and trunk in opal/threads/condition.h


    if (opal_using_threads()) {
#if OPAL_HAVE_POSIX_THREADS && OPAL_ENABLE_PROGRESS_THREADS
    rc = pthread_cond_wait(&c->c_cond, &m->m_lock_pthread);
#elif OPAL_HAVE_SOLARIS_THREADS && OPAL_ENABLE_PROGRESS_THREADS
    rc = cond_wait(&c->c_cond, &m->m_lock_solaris);
#else
        if (c->c_signaled) {
            c->c_waiting--;
            opal_mutex_unlock(m);
            opal_progress();

and from trunk:

    if (opal_using_threads()) {
#if OPAL_HAVE_POSIX_THREADS && OPAL_ENABLE_MULTI_THREADS
        rc = pthread_cond_wait(&c->c_cond, &m->m_lock_pthread);
#elif OPAL_HAVE_SOLARIS_THREADS && OPAL_ENABLE_MULTI_THREADS
        rc = cond_wait(&c->c_cond, &m->m_lock_solaris);
#else
        if (c->c_signaled) {
            c->c_waiting--;
            opal_mutex_unlock(m);
            opal_progress();

Now in 1.6 OPAL_ENABLE_PROGRESS_THREADS is hardcoded by configure to be
off. So even with mpi threads enabled when we are in
ompi_request_default_wait_all and call opal_condition_wait we still
call opal_progress.

In trunk OPAL_ENABLE_MULTI_THREADS is set to 1 if mpi threads are
enabled. Note that in 1.6 OPAL_ENABLE_MULTI_THREADS also exists and is
set to 1 if mpi threads are enabled, but as can be seen above is not
used to control how opal_condition_wait behaves.

So in trunk when MPI_THREAD_MULTIPLE is requrest in init, the
pthread_cond_wait path is taken. MPI programs get stuck because the
main thread sits in pthread_cond_wait and there appears to be no one
around to call opal_progress. I've looked around in the OMPI code to see
where a thread should be spawned to service opal_progress, but I
haven't been able to find it.

Between 1.6 and trunk OPAL_ENABLE_PROGRESS_THREADS seems to have
disappeared and OMPI_ENABLE_PROGRESS_THREADS has appeared. The latter
is hardcoded to be off. I tried to compile with
OMPI_ENABLE_PROGRESS_THREADS set, but there are compile errors
(presumably why its turned off). But I'm wondering if in
opal_condition_wait and a few other areas if OPAL_ENABLE_MULTI_THREADS should in fact be OMPI_ENABLE_PROGRESS_THREADS?

If I change a few of those OPAL_ENABLE_MULTI_THREADS to
OMPI_ENABLE_PROGRESS_THREADS (I don't know if I changed all that need to be changed) then I can start running threaded MPI programs again.

Imported from trac issue 3214. Created by cyeoh on 2012-08-08T21:00:22, last modified: 2013-06-11T21:17:00

Trac comment by rhc on 2012-08-16 00:16:23:

Replying to [comment:1 cyeoh]:

After further investigation it turns out that with --enable-event-thread-support AND --enable-orte-progress-threads (which is marked experimental) then the orte progress thread handles servicing the event thread and as a result the OOB messages establishing the ib connection succeeds.
[[BR]]
That is correct - ORTE progress threads requires libevent thread support, as was noted previously on the devel list.
[[BR]]
[[BR]]

However the test program still hangs as there is no thread to ensure progress for the btl openib. Normally btl_openib_component_progress is called as a callback by opal_progress, but with the main thread calling opal_condition_wait instead this never occurs.

Presumably the currently unfinished OMPI_ENABLE_PROGRESS_THREADS is eventually meant to fix this. However, in the meantime my suggestion (I'll attach a patch) is to reintroduce OPAL_ENABLE_PROGRESS_THREADS for opal/threads/condition.h and use it where OPAL_ENABLE_MULTI_THREADS currently selects between either using pthread_cond_wait or calling opal_progress in a loop and looking for the signal variable to change in the condition object.

[[BR]]
This doesn't sound right to me - the opal_enable_multi_threads flag is something quite different from enabling OMPI-level progress threads. Why not just fix the openib code itself? In other words, if ompi_enable_progress_threads is set, then have the openib code do the right thing?

We already agreed in the last developer meeting that we shouldn't be forcibly turning off ompi_enable_progress_threads - that was done solely because we were failing all the threaded tests and nobody was doing anything about it. If the code is finally reaching the point of working, or at least people are willing to work on it, then by all means remove that line in configure.ac.
[[BR]]
[[BR]]

This is strictly not needed for the orte components when --enable-orte-progress-threads is enabled, but there is no clean way to enable it just for uses by orte but not for ompi as its a compile time property.

Any thoughts?

Trac comment by cyeoh on 2012-08-15 09:46:08:

After further investigation it turns out that with --enable-event-thread-support AND --enable-orte-progress-threads (which is marked experimental) then the orte progress thread handles servicing the event thread and as a result the OOB messages establishing the ib connection succeeds.

However the test program still hangs as there is no thread to ensure progress for the btl openib. Normally btl_openib_component_progress is called as a callback by opal_progress, but with the main thread calling opal_condition_wait instead this never occurs.

Presumably the currently unfinished OMPI_ENABLE_PROGRESS_THREADS is eventually meant to fix this. However, in the meantime my suggestion (I'll attach a patch) is to reintroduce OPAL_ENABLE_PROGRESS_THREADS for opal/threads/condition.h and use it where OPAL_ENABLE_MULTI_THREADS currently selects between either using pthread_cond_wait or calling opal_progress in a loop and looking for the signal variable to change in the condition object.

This is strictly not needed for the orte components when --enable-orte-progress-threads is enabled, but there is no clean way to enable it just for uses by orte but not for ompi as its a compile time property.

Any thoughts?

Trac comment by cyeoh on 2012-08-16 01:03:36:

After further investigation it turns out that with --enable-event-thread-support AND --enable-orte-progress-threads (which is marked experimental) then the orte progress thread handles servicing the event thread and as a result the OOB messages establishing the ib connection succeeds.

That is correct - ORTE progress threads requires libevent thread support, as was noted previously on the devel list.

If enabling --enable-mpi-thread-multiple isn't going to work (and it currently doesn't) when using MPI_THREAD_MULTIPLE unless event thread support and the orte progress thread are also enabled, then should we be getting configure to automatically enable those too if enable-mpi-thread-multiple is requested?

This doesn't sound right to me - the opal_enable_multi_threads flag is something quite different from enabling OMPI-level progress threads. Why not just fix the openib code itself? In other words, if ompi_enable_progress_threads is set, then have the openib code do the right thing?

I think what is making things a bit difficult between 1.6 and trunk is that the opal thread implementation is automatically turned on when MPI thread multiple support is enabled. In 1.6 OPAL_ENABLE_MULTI_THREAD was turned on with mpi thread multiple support, but it didn't actually control anything - you had to turn on OPAL_ENABLE_PROGRESS_THREADS which was off by default.

So in 1.6 we could run with MPI_THREAD_MULTIPLE even though OMPI_ENABLE_PROGRESS_THREADS (which is incomplete) was disabled.

In trunk now we have to complete OMPI_ENABLE_PROGRESS_THREAD (which doesn't currently even compile) support before any threaded MPI program will work because otherwise there is nothing to progress the openib btl. Previously because the low level opal thread support was also disabled there was no assumption that there was a separate progress thread for the btls and rather than call pthread_cond_wait, opal_condition_wat would call opal_progress.

So I'm wondering if the current state is intentional - eg force the situation to get someone to complete OMPI_ENABLE_PROGRESS_THREAD support if the want to run with MPI_THREAD_MULTIPLE?

Trac comment by rhc on 2012-08-16 15:05:09:

Okay, to make things easier, I added a couple of checks in configuring so that we error out if orte-progress-threads are enabled without libevent thread support, and if ompi-progress-threads are enabled without orte-progress-threads.

I also checked and found that only one place didn't compile when OMPI progress threads are enabled. There was a variable ompi_progress_thread_count in ompi/request/req_wait.c that wasn't defined. It is only used in that file, so I statically defined it there and initialized it to zero. I have no idea if that's right, or if that will make things work - but at least everything now compiles.

We definitely want to at least -encourage- people to complete the progress thread code. We need async progress for MPI-3, which is a goal for the 1.7 series. So putting some pressure on would be good.

Hope that helps the debugging!

Trac comment by rolfv on 2012-12-17 12:41:14:

What is the status of this ticket?

Hello, what is the status of the bug? Debian has switched all their architectures to OpenMPI by default now, but the fact that OpenMPI has broken MPI_THREAD_MULTIPLE probably means it should switch back to MPICH?

The current version of OpenMPI on Debian is 1.10, do we have to assume that this bug is still open there?

I got the problem with 1.8.x versions and it seems to be solved in latest 1.10.3

v2.0.0 will fix this.

rhc54 commented

As noted, it is also solved in 1.10.3, so I'd consider this fixed