PMI1 shim hangs with dstore gds component, passes with hash
jjhursey opened this issue · 5 comments
Background information
What version of the PMIx Reference Library are you using? (e.g., v1.0, v2.1, git master @ hash, etc.)
- Open MPI v4.0.x (at open-mpi/ompi@cb3ed47) and PMIx 3.1.4
Describe how PMIx was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)
PMIx built with:
./configure --prefix=${PMIX_ROOT} \
--with-hwloc=${HWLOC_INSTALL_PATH} \
--with-libevent=${LIBEVENT_INSTALL_PATH} \
--enable-debug
Open MPI built with:
./configure --prefix=${MPI_ROOT} \
--with-hwloc=${HWLOC_INSTALL_PATH} \
--with-libevent=${LIBEVENT_INSTALL_PATH} \
--with-pmix=${PMIX_ROOT} \
--enable-mpirun-prefix-by-default
Please describe the system on which you are running
- Operating system/version: RHEL 7.6
- Computer hardware: ppc64le
- Network type: localhost
Details of the problem
I was testing PMI1 shim with the pmi_client.c
in the v3.1.4 release.
[mpiuser@3cd5ac47d23f test]$ env | grep PMIX
PMIX_ROOT=/home/mpiuser/local/pmix
[mpiuser@3cd5ac47d23f test]$ mpicc pmi_client.c -I${PMIX_ROOT}/include -L${PMIX_ROOT}/lib -lpmi -Wall -g -O0 -o pmi_client
[mpiuser@3cd5ac47d23f test]$ mpirun -np 2 ./pmi_client
0:INFO: spawned=0 size=2 rank=0 appnum=0
0:FATAL: 0: at test_item1:211
0:INFO: PMI_Get_id_length_max=255
0:INFO: jobid=2412576769
1:INFO: spawned=0 size=2 rank=1 appnum=0
1:FATAL: 1: at test_item1:211
1:INFO: PMI_Get_id_length_max=255
1:INFO: jobid=2412576769
1:INFO: PMI_Get_kvs_domain_id=2412576769
1:INFO: PMI_KVS_Get_my_name=2412576769
1:INFO: TI1 : PASS
1:INFO: TI2 : PASS
1:INFO: PMI_KVS_Get_key_length_max=511
1:INFO: PMI_KVS_Get_value_length_max=4096
1:INFO: TI3 : PASS
1:INFO: PMI_Get_clique_size=2
0:INFO: PMI_Get_kvs_domain_id=2412576769
0:INFO: PMI_KVS_Get_my_name=2412576769
0:INFO: TI1 : PASS
0:INFO: TI2 : PASS
0:INFO: PMI_KVS_Get_key_length_max=511
0:INFO: PMI_KVS_Get_value_length_max=4096
0:INFO: TI3 : PASS
0:INFO: PMI_Get_clique_size=2
1:INFO: TI4 : PASS
1:ERROR: PMIx and SLURM/PMI1 do not set 'PMI_process_mapping' (Do not mark test as failed)
1:INFO: TI5 : PASS
0:INFO: TI4 : PASS
0:ERROR: PMIx and SLURM/PMI1 do not set 'PMI_process_mapping' (Do not mark test as failed)
0:INFO: TI5 : PASS
<<<---- Hangs here
[mpiuser@3cd5ac47d23f ~]$ ps aux
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
mpiuser 1 0.0 0.0 5056 4288 pts/0 Ss 21:04 0:00 bash
mpiuser 43 0.5 0.0 259392 27648 pts/0 Sl+ 21:05 0:00 mpirun -np 2 .
mpiuser 48 0.0 0.0 98176 9984 pts/0 Sl 21:05 0:00 ./pmi_client
mpiuser 49 0.0 0.0 98176 9984 pts/0 Sl 21:05 0:00 ./pmi_client
mpiuser 52 0.3 0.0 4928 4160 pts/3 Ss 21:05 0:00 bash
mpiuser 71 0.0 0.0 10752 7616 pts/3 R+ 21:05 0:00 ps aux
[mpiuser@3cd5ac47d23f ~]$ gstack 48
Thread 2 (Thread 0x3fff8207f1b0 (LWP 50)):
#0 0x00003fff82b09178 in epoll_wait () from /lib64/libc.so.6
openpmix/openpmix#1 0x00003fff8296b18c in epoll_dispatch (base=0x1001c3f4a10, tv=<optimized out>) at epoll.c:462
openpmix/openpmix#2 0x00003fff8295c180 in event_base_loop (base=0x1001c3f4a10, flags=<optimized out>) at event.c:1947
openpmix/openpmix#3 0x00003fff82e3cc0c in progress_engine () from /home/mpiuser/local/pmix/lib/libpmi.so.1
openpmix/openpmix#4 0x00003fff82bd8b94 in start_thread () from /lib64/libpthread.so.0
openpmix/openpmix#5 0x00003fff82b085f4 in clone () from /lib64/libc.so.6
Thread 1 (Thread 0x3fff82f57cd0 (LWP 48)):
#0 0x00003fff82bde7fc in pthread_cond_wait@@GLIBC_2.17 () from /lib64/libpthread.so.0
openpmix/openpmix#1 0x00003fff82df3e6c in PMIx_Get () from /home/mpiuser/local/pmix/lib/libpmi.so.1
openpmix/openpmix#2 0x00003fff82e6d9d8 in PMI_KVS_Get () from /home/mpiuser/local/pmix/lib/libpmi.so.1
openpmix/openpmix#3 0x0000000010002af8 in test_item6 () at pmi_client.c:384
openpmix/openpmix#4 0x00000000100019bc in main (argc=1, argv=0x3ffffa01f7d8) at pmi_client.c:160
[mpiuser@3cd5ac47d23f ~]$ gstack 49
Thread 2 (Thread 0x3fff91a4f1b0 (LWP 51)):
#0 0x00003fff924d9178 in epoll_wait () from /lib64/libc.so.6
openpmix/openpmix#1 0x00003fff9233b18c in epoll_dispatch (base=0x1000ccf4a10, tv=<optimized out>) at epoll.c:462
openpmix/openpmix#2 0x00003fff9232c180 in event_base_loop (base=0x1000ccf4a10, flags=<optimized out>) at event.c:1947
openpmix/openpmix#3 0x00003fff9280cc0c in progress_engine () from /home/mpiuser/local/pmix/lib/libpmi.so.1
openpmix/openpmix#4 0x00003fff925a8b94 in start_thread () from /lib64/libpthread.so.0
openpmix/openpmix#5 0x00003fff924d85f4 in clone () from /lib64/libc.so.6
Thread 1 (Thread 0x3fff92927cd0 (LWP 49)):
#0 0x00003fff925ae7fc in pthread_cond_wait@@GLIBC_2.17 () from /lib64/libpthread.so.0
openpmix/openpmix#1 0x00003fff927c3e6c in PMIx_Get () from /home/mpiuser/local/pmix/lib/libpmi.so.1
openpmix/openpmix#2 0x00003fff9283d9d8 in PMI_KVS_Get () from /home/mpiuser/local/pmix/lib/libpmi.so.1
openpmix/openpmix#3 0x0000000010002af8 in test_item6 () at pmi_client.c:384
openpmix/openpmix#4 0x00000000100019bc in main (argc=1, argv=0x3fffcb9b0b68) at pmi_client.c:160
If I force the hash
GDS component then it works
[mpiuser@3cd5ac47d23f test]$ export PMIX_MCA_gds=hash
[mpiuser@3cd5ac47d23f test]$ mpirun -np 2 ./pmi_client
0:INFO: spawned=0 size=2 rank=0 appnum=0
0:FATAL: 0: at test_item1:211
0:INFO: PMI_Get_id_length_max=255
0:INFO: jobid=2408185857
0:INFO: PMI_Get_kvs_domain_id=2408185857
0:INFO: PMI_KVS_Get_my_name=2408185857
0:INFO: TI1 : PASS
0:INFO: TI2 : PASS
0:INFO: PMI_KVS_Get_key_length_max=511
0:INFO: PMI_KVS_Get_value_length_max=4096
0:INFO: TI3 : PASS
0:INFO: PMI_Get_clique_size=2
0:INFO: TI4 : PASS
0:ERROR: PMIx and SLURM/PMI1 do not set 'PMI_process_mapping' (Do not mark test as failed)
0:INFO: TI5 : PASS
0:INFO: tkey=0:test_item6 tval=pmi_client.c val=pmi_client.c
0:INFO: TI6 : PASS
0:INFO: TEST7
0:INFO: BARRIER
1:INFO: spawned=0 size=2 rank=1 appnum=0
1:FATAL: 1: at test_item1:211
1:INFO: PMI_Get_id_length_max=255
1:INFO: jobid=2408185857
1:INFO: PMI_Get_kvs_domain_id=2408185857
1:INFO: PMI_KVS_Get_my_name=2408185857
1:INFO: TI1 : PASS
1:INFO: TI2 : PASS
1:INFO: PMI_KVS_Get_key_length_max=511
1:INFO: PMI_KVS_Get_value_length_max=4096
1:INFO: TI3 : PASS
1:INFO: PMI_Get_clique_size=2
1:INFO: TI4 : PASS
1:ERROR: PMIx and SLURM/PMI1 do not set 'PMI_process_mapping' (Do not mark test as failed)
1:INFO: TI5 : PASS
1:INFO: tkey=1:test_item6 tval=pmi_client.c val=pmi_client.c
1:INFO: TI6 : PASS
1:INFO: TEST7
1:INFO: BARRIER
0:INFO: Get key 0:KEY-0
0:INFO: tkey=0:KEY-0 tval=VALUE-0 val=VALUE-0
0:INFO: Get key 0:KEY-1
0:INFO: tkey=0:KEY-1 tval=VALUE-1 val=VALUE-1
0:INFO: Get key 1:KEY-0
1:INFO: Get key 0:KEY-0
1:INFO: tkey=0:KEY-0 tval=VALUE-0 val=VALUE-0
1:INFO: Get key 0:KEY-1
1:INFO: tkey=0:KEY-1 tval=VALUE-1 val=VALUE-1
1:INFO: Get key 1:KEY-0
1:INFO: tkey=1:KEY-0 tval=VALUE-0 val=VALUE-0
1:INFO: Get key 1:KEY-1
0:INFO: tkey=1:KEY-0 tval=VALUE-0 val=VALUE-0
0:INFO: Get key 1:KEY-1
1:INFO: tkey=1:KEY-1 tval=VALUE-1 val=VALUE-1
1:INFO: TI7 : PASS
0:INFO: tkey=1:KEY-1 tval=VALUE-1 val=VALUE-1
0:INFO: TI7 : PASS
Per discussion on the teleconf this is likely because the dstore does not have an exhaustive search path for the PMI1 case where the process identifer is not provided in the PMI_Get_*
operations.
I wanted to file this so that it is searchable for folks that might hit this type of issue when using PMI1.
This should/will be shifted to the new PMI-1/2 repo: https://github.com/openpmix/pmi-shim
@rhc54 : any objection to using Github's "Transfer issue" to move the issue to the new repo? That will preserve the issue discussion/history thus far.
Didn't know about it
Thanks!