openpmix/pmi-shim

PMI1 shim hangs with dstore gds component, passes with hash

jjhursey opened this issue · 5 comments

Background information

What version of the PMIx Reference Library are you using? (e.g., v1.0, v2.1, git master @ hash, etc.)

Describe how PMIx was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

PMIx built with:

    ./configure --prefix=${PMIX_ROOT} \
                --with-hwloc=${HWLOC_INSTALL_PATH} \
                --with-libevent=${LIBEVENT_INSTALL_PATH} \
                --enable-debug

Open MPI built with:

    ./configure --prefix=${MPI_ROOT} \
                --with-hwloc=${HWLOC_INSTALL_PATH} \
                --with-libevent=${LIBEVENT_INSTALL_PATH} \
                --with-pmix=${PMIX_ROOT} \
                --enable-mpirun-prefix-by-default

Please describe the system on which you are running

  • Operating system/version: RHEL 7.6
  • Computer hardware: ppc64le
  • Network type: localhost

Details of the problem

I was testing PMI1 shim with the pmi_client.c in the v3.1.4 release.

[mpiuser@3cd5ac47d23f test]$ env | grep PMIX
PMIX_ROOT=/home/mpiuser/local/pmix
[mpiuser@3cd5ac47d23f test]$ mpicc pmi_client.c -I${PMIX_ROOT}/include -L${PMIX_ROOT}/lib -lpmi -Wall -g -O0 -o pmi_client
[mpiuser@3cd5ac47d23f test]$ mpirun -np 2 ./pmi_client
0:INFO: spawned=0 size=2 rank=0 appnum=0
0:FATAL: 0: at test_item1:211
0:INFO: PMI_Get_id_length_max=255
0:INFO: jobid=2412576769
1:INFO: spawned=0 size=2 rank=1 appnum=0
1:FATAL: 1: at test_item1:211
1:INFO: PMI_Get_id_length_max=255
1:INFO: jobid=2412576769
1:INFO: PMI_Get_kvs_domain_id=2412576769
1:INFO: PMI_KVS_Get_my_name=2412576769
1:INFO: TI1  : PASS
1:INFO: TI2  : PASS
1:INFO: PMI_KVS_Get_key_length_max=511
1:INFO: PMI_KVS_Get_value_length_max=4096
1:INFO: TI3  : PASS
1:INFO: PMI_Get_clique_size=2
0:INFO: PMI_Get_kvs_domain_id=2412576769
0:INFO: PMI_KVS_Get_my_name=2412576769
0:INFO: TI1  : PASS
0:INFO: TI2  : PASS
0:INFO: PMI_KVS_Get_key_length_max=511
0:INFO: PMI_KVS_Get_value_length_max=4096
0:INFO: TI3  : PASS
0:INFO: PMI_Get_clique_size=2
1:INFO: TI4  : PASS
1:ERROR: PMIx and SLURM/PMI1 do not set 'PMI_process_mapping' (Do not mark test as failed)
1:INFO: TI5  : PASS
0:INFO: TI4  : PASS
0:ERROR: PMIx and SLURM/PMI1 do not set 'PMI_process_mapping' (Do not mark test as failed)
0:INFO: TI5  : PASS
<<<---- Hangs here
[mpiuser@3cd5ac47d23f ~]$ ps aux
USER        PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
mpiuser       1  0.0  0.0   5056  4288 pts/0    Ss   21:04   0:00 bash
mpiuser      43  0.5  0.0 259392 27648 pts/0    Sl+  21:05   0:00 mpirun -np 2 .
mpiuser      48  0.0  0.0  98176  9984 pts/0    Sl   21:05   0:00 ./pmi_client
mpiuser      49  0.0  0.0  98176  9984 pts/0    Sl   21:05   0:00 ./pmi_client
mpiuser      52  0.3  0.0   4928  4160 pts/3    Ss   21:05   0:00 bash
mpiuser      71  0.0  0.0  10752  7616 pts/3    R+   21:05   0:00 ps aux
[mpiuser@3cd5ac47d23f ~]$ gstack 48
Thread 2 (Thread 0x3fff8207f1b0 (LWP 50)):
#0  0x00003fff82b09178 in epoll_wait () from /lib64/libc.so.6
openpmix/openpmix#1  0x00003fff8296b18c in epoll_dispatch (base=0x1001c3f4a10, tv=<optimized out>) at epoll.c:462
openpmix/openpmix#2  0x00003fff8295c180 in event_base_loop (base=0x1001c3f4a10, flags=<optimized out>) at event.c:1947
openpmix/openpmix#3  0x00003fff82e3cc0c in progress_engine () from /home/mpiuser/local/pmix/lib/libpmi.so.1
openpmix/openpmix#4  0x00003fff82bd8b94 in start_thread () from /lib64/libpthread.so.0
openpmix/openpmix#5  0x00003fff82b085f4 in clone () from /lib64/libc.so.6
Thread 1 (Thread 0x3fff82f57cd0 (LWP 48)):
#0  0x00003fff82bde7fc in pthread_cond_wait@@GLIBC_2.17 () from /lib64/libpthread.so.0
openpmix/openpmix#1  0x00003fff82df3e6c in PMIx_Get () from /home/mpiuser/local/pmix/lib/libpmi.so.1
openpmix/openpmix#2  0x00003fff82e6d9d8 in PMI_KVS_Get () from /home/mpiuser/local/pmix/lib/libpmi.so.1
openpmix/openpmix#3  0x0000000010002af8 in test_item6 () at pmi_client.c:384
openpmix/openpmix#4  0x00000000100019bc in main (argc=1, argv=0x3ffffa01f7d8) at pmi_client.c:160
[mpiuser@3cd5ac47d23f ~]$ gstack 49
Thread 2 (Thread 0x3fff91a4f1b0 (LWP 51)):
#0  0x00003fff924d9178 in epoll_wait () from /lib64/libc.so.6
openpmix/openpmix#1  0x00003fff9233b18c in epoll_dispatch (base=0x1000ccf4a10, tv=<optimized out>) at epoll.c:462
openpmix/openpmix#2  0x00003fff9232c180 in event_base_loop (base=0x1000ccf4a10, flags=<optimized out>) at event.c:1947
openpmix/openpmix#3  0x00003fff9280cc0c in progress_engine () from /home/mpiuser/local/pmix/lib/libpmi.so.1
openpmix/openpmix#4  0x00003fff925a8b94 in start_thread () from /lib64/libpthread.so.0
openpmix/openpmix#5  0x00003fff924d85f4 in clone () from /lib64/libc.so.6
Thread 1 (Thread 0x3fff92927cd0 (LWP 49)):
#0  0x00003fff925ae7fc in pthread_cond_wait@@GLIBC_2.17 () from /lib64/libpthread.so.0
openpmix/openpmix#1  0x00003fff927c3e6c in PMIx_Get () from /home/mpiuser/local/pmix/lib/libpmi.so.1
openpmix/openpmix#2  0x00003fff9283d9d8 in PMI_KVS_Get () from /home/mpiuser/local/pmix/lib/libpmi.so.1
openpmix/openpmix#3  0x0000000010002af8 in test_item6 () at pmi_client.c:384
openpmix/openpmix#4  0x00000000100019bc in main (argc=1, argv=0x3fffcb9b0b68) at pmi_client.c:160

If I force the hash GDS component then it works

[mpiuser@3cd5ac47d23f test]$ export PMIX_MCA_gds=hash
[mpiuser@3cd5ac47d23f test]$ mpirun -np 2 ./pmi_client
0:INFO: spawned=0 size=2 rank=0 appnum=0
0:FATAL: 0: at test_item1:211
0:INFO: PMI_Get_id_length_max=255
0:INFO: jobid=2408185857
0:INFO: PMI_Get_kvs_domain_id=2408185857
0:INFO: PMI_KVS_Get_my_name=2408185857
0:INFO: TI1  : PASS
0:INFO: TI2  : PASS
0:INFO: PMI_KVS_Get_key_length_max=511
0:INFO: PMI_KVS_Get_value_length_max=4096
0:INFO: TI3  : PASS
0:INFO: PMI_Get_clique_size=2
0:INFO: TI4  : PASS
0:ERROR: PMIx and SLURM/PMI1 do not set 'PMI_process_mapping' (Do not mark test as failed)
0:INFO: TI5  : PASS
0:INFO: tkey=0:test_item6 tval=pmi_client.c val=pmi_client.c
0:INFO: TI6  : PASS
0:INFO: TEST7
0:INFO: BARRIER
1:INFO: spawned=0 size=2 rank=1 appnum=0
1:FATAL: 1: at test_item1:211
1:INFO: PMI_Get_id_length_max=255
1:INFO: jobid=2408185857
1:INFO: PMI_Get_kvs_domain_id=2408185857
1:INFO: PMI_KVS_Get_my_name=2408185857
1:INFO: TI1  : PASS
1:INFO: TI2  : PASS
1:INFO: PMI_KVS_Get_key_length_max=511
1:INFO: PMI_KVS_Get_value_length_max=4096
1:INFO: TI3  : PASS
1:INFO: PMI_Get_clique_size=2
1:INFO: TI4  : PASS
1:ERROR: PMIx and SLURM/PMI1 do not set 'PMI_process_mapping' (Do not mark test as failed)
1:INFO: TI5  : PASS
1:INFO: tkey=1:test_item6 tval=pmi_client.c val=pmi_client.c
1:INFO: TI6  : PASS
1:INFO: TEST7
1:INFO: BARRIER
0:INFO: Get key 0:KEY-0
0:INFO: tkey=0:KEY-0 tval=VALUE-0 val=VALUE-0
0:INFO: Get key 0:KEY-1
0:INFO: tkey=0:KEY-1 tval=VALUE-1 val=VALUE-1
0:INFO: Get key 1:KEY-0
1:INFO: Get key 0:KEY-0
1:INFO: tkey=0:KEY-0 tval=VALUE-0 val=VALUE-0
1:INFO: Get key 0:KEY-1
1:INFO: tkey=0:KEY-1 tval=VALUE-1 val=VALUE-1
1:INFO: Get key 1:KEY-0
1:INFO: tkey=1:KEY-0 tval=VALUE-0 val=VALUE-0
1:INFO: Get key 1:KEY-1
0:INFO: tkey=1:KEY-0 tval=VALUE-0 val=VALUE-0
0:INFO: Get key 1:KEY-1
1:INFO: tkey=1:KEY-1 tval=VALUE-1 val=VALUE-1
1:INFO: TI7  : PASS
0:INFO: tkey=1:KEY-1 tval=VALUE-1 val=VALUE-1
0:INFO: TI7  : PASS

Per discussion on the teleconf this is likely because the dstore does not have an exhaustive search path for the PMI1 case where the process identifer is not provided in the PMI_Get_* operations.

I wanted to file this so that it is searchable for folks that might hit this type of issue when using PMI1.

rhc54 commented

This should/will be shifted to the new PMI-1/2 repo: https://github.com/openpmix/pmi-shim

@rhc54 : any objection to using Github's "Transfer issue" to move the issue to the new repo? That will preserve the issue discussion/history thus far.

rhc54 commented

Didn't know about it 😄 Sure, I can do that.

Thanks!