pmodels/mpich

CUDA-aware MPICH segfaults with `MPIR_CVAR_CH4_OFI_ENABLE_HMEM=1`

Closed this issue · 2 comments

I'm running a CUDA-aware MPICH + libfabrics build on a two-node configuration (one Nvidia GPU per node). Here's the output of mpichversion:

❯ mpichversion
MPICH Version:      4.2.2
MPICH Release date: Wed Jul  3 09:16:22 AM CDT 2024
MPICH ABI:          16:2:4
MPICH Device:       ch4:ofi
MPICH configure:    --prefix=/gpfs/home/acad/ucl-tfl/poncelet/soft/lib-MPICH-4.2.2-OFI-1.22.0-CUDA-12.2.0-opt --enable-fast=O3,ndebug,alwaysinline --with-cuda=/gpfs/softs/easybuild/2023a/software/CUDA/12.2.0 --with-device=ch4:ofi --with-libfabric=/gpfs/home/acad/ucl-tfl/poncelet/soft/lib-MPICH-4.2.2-OFI-1.22.0-CUDA-12.2.0-opt --enable-fast=O3,ndebug,alwaysinline --with-cuda=/gpfs/softs/easybuild/2023a/software/CUDA/12.2.0 --with-device=ch4:ofi --with-libfabric=/gpfs/home/acad/ucl-tfl/poncelet/soft/lib-MPICH-4.2.2-OFI-1.22.0-CUDA-12.2.0-opt
MPICH CC:           gcc    -DNDEBUG -DNVALGRIND -O3
MPICH CXX:          g++   -DNDEBUG -DNVALGRIND -O3
MPICH F77:          gfortran   -O3
MPICH FC:           gfortran   -O3
MPICH features:     threadcomm

For context, I am running a performance comparison of MPICH+libfabrics (using OSU microbenchmarks) vs libfabrics (using fabtests) alone. For both tests, the following environment variables are set:

# OFI env vars
export FI_PROVIDER="verbs,ofi_rxm,shm"
export FI_HMEM_CUDA_USE_GDRCOPY=1

export FI_OFI_RXM_BUFFER_SIZE=2048
export FI_OFI_RXM_SAR_LIMIT=2048

# MPICH env vars
export MPIR_CVAR_NOLOCAL=1
export MPIR_CVAR_ENABLE_GPU=1
export MPIR_CVAR_DEBUG_SUMMARY=1

Here is the result of running fi_bw -p "verbs;ofi_rxm" -D cuda:

bytes   iters   total       time     MB/sec    usec/xfer   Mxfers/sec
64      64      4k          0.00s     45.01       1.42       0.70
256     64      16k         0.00s     93.09       2.75       0.36
1k      64      64k         0.00s     17.95      57.06       0.02
4k      64      256k        0.00s   3692.17       1.11       0.90
64k     64      4m          0.00s  17476.27       3.75       0.27
1m      64      64m         0.00s  19907.70      52.67       0.02

And that of running mpirun --bind-to core ${omb_dir}/pt2pt/osu_bw D D:

# OSU MPI-CUDA Bandwidth Test v7.4
# Datatype: MPI_CHAR.
# Size      Bandwidth (MB/s)
1                       0.07
2                       0.14
4                       0.29
8                       0.57
16                      1.15
32                      2.31
64                      4.65
128                     9.32
256                    18.51
512                    36.00
1024                   71.42
2048                  145.81
4096                  264.82
8192                  513.88
16384                 953.50
32768                1721.26
65536                2937.49
131072               1696.09
262144               2103.23
524288               2262.00
1048576              2258.56
2097152              2403.49
4194304              3391.54

which shows a pretty big bandiwidth difference with and without the MPICH layer.

Now, in an attempt to improve this, we set MPIR_CVAR_CH4_OFI_ENABLE_HMEM=1, which lead to the following segfault:

==== GPU Init (CUDA) ====
device_count: 1
CUDA_VISIBLE_DEVICES: 0
=========================
==== Various sizes and limits ====
sizeof(MPIDI_per_vci_t): 192
Required minimum FI_VERSION: 0, current version: 10016
==== GPU Init (CUDA) ====
device_count: 1
CUDA_VISIBLE_DEVICES: 0
=========================
provider: verbs, score = 0, pref = 0, FI_SOCKADDR_IN [16] 10.93.4.219
provider: verbs, score = 0, pref = 0, FI_SOCKADDR_IB [48]
provider: verbs, score = 0, pref = 0, FI_SOCKADDR_IN [16] 10.93.4.219
provider: verbs, score = 0, pref = 0, FI_SOCKADDR_IB [48]
provider: verbs, score = 0, pref = 0, FI_SOCKADDR_IN [16] 10.93.4.219
provider: verbs, score = 0, pref = 0, FI_SOCKADDR_IB [48]
provider: verbs, score = 0, pref = 0, FI_SOCKADDR_IN [16] 10.93.4.219
provider: verbs, score = 0, pref = 0, FI_SOCKADDR_IB [48]
provider: verbs, score = 0, pref = 0, FI_FORMAT_UNSPEC [32]
provider: verbs, score = 0, pref = 0, FI_FORMAT_UNSPEC [32]
provider: verbs, score = 0, pref = 0, FI_SOCKADDR_IB [48]
provider: verbs, score = 0, pref = 0, FI_SOCKADDR_IB [48]
provider: verbs, score = 0, pref = 0, FI_SOCKADDR_IB [48]
provider: verbs, score = 0, pref = 0, FI_SOCKADDR_IB [48]
provider: verbs, score = 0, pref = 0, FI_FORMAT_UNSPEC [32]
provider: verbs, score = 0, pref = 0, FI_FORMAT_UNSPEC [32]
provider: verbs;ofi_rxm, score = 0, pref = 0, FI_SOCKADDR_IN [16] 10.93.4.219
provider: verbs;ofi_rxm, score = 0, pref = 0, FI_SOCKADDR_IB [48]
provider: verbs;ofi_rxm, score = 5, pref = 0, FI_SOCKADDR_IN [16] 10.93.4.219
provider: verbs;ofi_rxm, score = 5, pref = 0, FI_SOCKADDR_IB [48]
provider: verbs;ofi_rxm, score = 0, pref = 0, FI_SOCKADDR_IB [48]
provider: verbs;ofi_rxm, score = 5, pref = 0, FI_SOCKADDR_IB [48]
provider: verbs;ofi_rxm, score = 0, pref = 0, FI_SOCKADDR_IN [16] 10.93.4.219
provider: verbs;ofi_rxm, score = 0, pref = 0, FI_SOCKADDR_IB [48]
provider: verbs;ofi_rxm, score = 0, pref = 0, FI_SOCKADDR_IN [16] 10.93.4.219
provider: verbs;ofi_rxm, score = 0, pref = 0, FI_SOCKADDR_IB [48]
provider: verbs;ofi_rxm, score = 0, pref = 0, FI_SOCKADDR_IN [16] 10.93.4.219
provider: verbs;ofi_rxm, score = 0, pref = 0, FI_SOCKADDR_IB [48]
provider: verbs;ofi_rxm, score = 0, pref = 0, FI_SOCKADDR_IN [16] 10.93.4.219
provider: verbs;ofi_rxm, score = 0, pref = 0, FI_SOCKADDR_IB [48]
provider: verbs;ofi_rxm, score = 0, pref = 0, FI_SOCKADDR_IB [48]
provider: verbs;ofi_rxm, score = 0, pref = 0, FI_SOCKADDR_IB [48]
provider: verbs;ofi_rxm, score = 0, pref = 0, FI_SOCKADDR_IB [48]
provider: verbs;ofi_rxm, score = 0, pref = 0, FI_SOCKADDR_IB [48]
provider: verbs;ofi_rxm, score = 0, pref = 0, FI_SOCKADDR_IN [16] 10.93.4.219
provider: verbs;ofi_rxm, score = 0, pref = 0, FI_SOCKADDR_IB [48]
provider: verbs;ofi_rxm, score = 5, pref = 0, FI_SOCKADDR_IN [16] 10.93.4.219
provider: verbs;ofi_rxm, score = 5, pref = 0, FI_SOCKADDR_IB [48]
provider: verbs;ofi_rxm, score = 0, pref = 0, FI_SOCKADDR_IB [48]
provider: verbs;ofi_rxm, score = 5, pref = 0, FI_SOCKADDR_IB [48]
provider: verbs;ofi_rxm, score = 0, pref = 0, FI_SOCKADDR_IN [16] 10.93.4.219
provider: verbs;ofi_rxm, score = 0, pref = 0, FI_SOCKADDR_IB [48]
provider: verbs;ofi_rxm, score = 0, pref = 0, FI_SOCKADDR_IN [16] 10.93.4.219
provider: verbs;ofi_rxm, score = 0, pref = 0, FI_SOCKADDR_IB [48]
provider: verbs;ofi_rxm, score = 0, pref = 0, FI_SOCKADDR_IB [48]
provider: verbs;ofi_rxm, score = 0, pref = 0, FI_SOCKADDR_IB [48]
provider: verbs;ofi_rxd, score = 0, pref = -2, FI_FORMAT_UNSPEC [32]
provider: verbs;ofi_rxd, score = 0, pref = -2, FI_FORMAT_UNSPEC [32]
provider: verbs;ofi_rxd, score = 0, pref = -2, FI_FORMAT_UNSPEC [32]
provider: verbs;ofi_rxd, score = 0, pref = -2, FI_FORMAT_UNSPEC [32]
provider: shm, score = 0, pref = -2, FI_ADDR_STR [17] - fi_shm://2755918
provider: shm, score = 5, pref = -2, FI_ADDR_STR [17] - fi_shm://2755918
Required minimum FI_VERSION: 10006, current version: 10016
==== Capability set configuration ====
libfabric provider: verbs;ofi_rxm - IB-0xfe80000000000000
MPIDI_OFI_ENABLE_DATA: 0
MPIDI_OFI_ENABLE_AV_TABLE: 1
MPIDI_OFI_ENABLE_SCALABLE_ENDPOINTS: 0
MPIDI_OFI_ENABLE_SHARED_CONTEXTS: 0
MPIDI_OFI_ENABLE_MR_VIRT_ADDRESS: 1
MPIDI_OFI_ENABLE_MR_ALLOCATED: 1
MPIDI_OFI_ENABLE_MR_REGISTER_NULL: 1
MPIDI_OFI_ENABLE_MR_PROV_KEY: 1
MPIDI_OFI_ENABLE_TAGGED: 1
MPIDI_OFI_ENABLE_AM: 1
MPIDI_OFI_ENABLE_RMA: 1
MPIDI_OFI_ENABLE_ATOMICS: 0
MPIDI_OFI_FETCH_ATOMIC_IOVECS: 1
MPIDI_OFI_ENABLE_DATA_AUTO_PROGRESS: 0
MPIDI_OFI_ENABLE_CONTROL_AUTO_PROGRESS: 0
MPIDI_OFI_ENABLE_PT2PT_NOPACK: 1
MPIDI_OFI_ENABLE_TRIGGERED: 0
MPIDI_OFI_ENABLE_HMEM: 1
MPIDI_OFI_NUM_AM_BUFFERS: 8
MPIDI_OFI_NUM_OPTIMIZED_MEMORY_REGIONS: 0
MPIDI_OFI_CONTEXT_BITS: 16
MPIDI_OFI_SOURCE_BITS: 23
MPIDI_OFI_TAG_BITS: 20
MPIDI_OFI_VNI_USE_DOMAIN: 1
MAXIMUM SUPPORTED RANKS: 8388608
MAXIMUM TAG: 1048576
==== Provider global thresholds ====
max_buffered_send: 192
max_buffered_write: 192
max_msg_size: 1073741824
max_order_raw: 1073741824
max_order_war: 0
max_order_waw: 1073741824
tx_iov_limit: 4
rx_iov_limit: 4
rma_iov_limit: 1
max_mr_key_size: 4
==== Various sizes and limits ====
MPIDI_OFI_AM_MSG_HEADER_SIZE: 24
MPIDI_OFI_MAX_AM_HDR_SIZE: 255
sizeof(MPIDI_OFI_am_request_header_t): 416
sizeof(MPIDI_OFI_per_vci_t): 52480
MPIDI_OFI_AM_HDR_POOL_CELL_SIZE: 1024
MPIDI_OFI_DEFAULT_SHORT_SEND_SIZE: 16384
==== OFI dynamic settings ====
num_vcis: 1
num_nics: 1
======================================
error checking    : enabled
QMPI              : disabled
debugger support  : disabled
thread level      : MPI_THREAD_SINGLE
thread CS         : per-vci
threadcomm        : enabled
==== data structure summary ====
sizeof(MPIR_Comm): 1792
sizeof(MPIR_Request): 512
sizeof(MPIR_Datatype): 280
================================

# OSU MPI-CUDA Bandwidth Test v7.4
# Datatype: MPI_CHAR.
# Size      Bandwidth (MB/s)
1                       0.02
2                       0.02
4                       0.02
8                       0.03
16                      0.05
32                      0.09
64                      0.14
128                     0.26
256                     0.45

===================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   PID 2755918 RUNNING AT cna019
=   EXIT CODE: 139
=   CLEANING UP REMAINING PROCESSES
=   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
===================================================================================
[proxy:1@cna021.lucia.cenaero.be] HYD_pmcd_pmip_control_cmd_cb (proxy/pmip_cb.c:484): assert (!closed) failed
[proxy:1@cna021.lucia.cenaero.be] HYDT_dmxu_poll_wait_for_event (lib/tools/demux/demux_poll.c:76): callback returned error status
[proxy:1@cna021.lucia.cenaero.be] main (proxy/pmip.c:122): demux engine error waiting for event
srun: error: cna021: task 1: Exited with exit code 7
YOUR APPLICATION TERMINATED WITH THE EXIT STRING: Segmentation fault (signal 11)
This typically refers to a problem with your application.
Please see the FAQ page for debugging suggestions

Update: we tried this with another ping-pong bw test, and there's no issue... so it's looking likely that the issue is actually lying with the osu benchmarks rather than MPICH itself.
Results obtained for reference:

[0] Transfer size (B):          8, Transfer Time (s):     0.000014462, Bandwidth (GB/s):     0.000515191
[0] Transfer size (B):         16, Transfer Time (s):     0.000013622, Bandwidth (GB/s):     0.001093902
[0] Transfer size (B):         32, Transfer Time (s):     0.000014496, Bandwidth (GB/s):     0.002055962
[0] Transfer size (B):         64, Transfer Time (s):     0.000014030, Bandwidth (GB/s):     0.004248468
[0] Transfer size (B):        128, Transfer Time (s):     0.000015202, Bandwidth (GB/s):     0.007841798
[0] Transfer size (B):        256, Transfer Time (s):     0.000020035, Bandwidth (GB/s):     0.011900377
[0] Transfer size (B):        512, Transfer Time (s):     0.000037671, Bandwidth (GB/s):     0.012658082
[0] Transfer size (B):       1024, Transfer Time (s):     0.000093290, Bandwidth (GB/s):     0.010222732
[0] Transfer size (B):       2048, Transfer Time (s):     0.000148286, Bandwidth (GB/s):     0.012862617
[0] Transfer size (B):       4096, Transfer Time (s):     0.000017991, Bandwidth (GB/s):     0.212033525
[0] Transfer size (B):       8192, Transfer Time (s):     0.000017847, Bandwidth (GB/s):     0.427500166
[0] Transfer size (B):      16384, Transfer Time (s):     0.000018190, Bandwidth (GB/s):     0.838870671
[0] Transfer size (B):      32768, Transfer Time (s):     0.000018811, Bandwidth (GB/s):     1.622305501
[0] Transfer size (B):      65536, Transfer Time (s):     0.000020096, Bandwidth (GB/s):     3.037141569
[0] Transfer size (B):     131072, Transfer Time (s):     0.000023546, Bandwidth (GB/s):     5.184375161
[0] Transfer size (B):     262144, Transfer Time (s):     0.000028957, Bandwidth (GB/s):     8.431129034
[0] Transfer size (B):     524288, Transfer Time (s):     0.000039995, Bandwidth (GB/s):    12.208517637
[0] Transfer size (B):    1048576, Transfer Time (s):     0.000063189, Bandwidth (GB/s):    15.454633071
[0] Transfer size (B):    2097152, Transfer Time (s):     0.000108264, Bandwidth (GB/s):    18.040372009
[0] Transfer size (B):    4194304, Transfer Time (s):     0.000201065, Bandwidth (GB/s):    19.427837563
[0] Transfer size (B):    8388608, Transfer Time (s):     0.000387005, Bandwidth (GB/s):    20.187066211
[0] Transfer size (B):   16777216, Transfer Time (s):     0.000754742, Bandwidth (GB/s):    20.702448315
[0] Transfer size (B):   33554432, Transfer Time (s):     0.001491261, Bandwidth (GB/s):    20.955424526
[0] Transfer size (B):   67108864, Transfer Time (s):     0.003488139, Bandwidth (GB/s):    17.917864812
[0] Transfer size (B):  134217728, Transfer Time (s):     0.006919051, Bandwidth (GB/s):    18.066061737
[0] Transfer size (B):  268435456, Transfer Time (s):     0.013780443, Bandwidth (GB/s):    18.141652445
[0] Transfer size (B):  536870912, Transfer Time (s):     0.027505422, Bandwidth (GB/s):    18.178234240
[0] Transfer size (B): 1073741824, Transfer Time (s):     0.054889985, Bandwidth (GB/s):    18.218259620