CUDA-aware MPICH segfaults with `MPIR_CVAR_CH4_OFI_ENABLE_HMEM=1`
Closed this issue · 2 comments
I'm running a CUDA-aware MPICH + libfabrics build on a two-node configuration (one Nvidia GPU per node). Here's the output of mpichversion
:
❯ mpichversion
MPICH Version: 4.2.2
MPICH Release date: Wed Jul 3 09:16:22 AM CDT 2024
MPICH ABI: 16:2:4
MPICH Device: ch4:ofi
MPICH configure: --prefix=/gpfs/home/acad/ucl-tfl/poncelet/soft/lib-MPICH-4.2.2-OFI-1.22.0-CUDA-12.2.0-opt --enable-fast=O3,ndebug,alwaysinline --with-cuda=/gpfs/softs/easybuild/2023a/software/CUDA/12.2.0 --with-device=ch4:ofi --with-libfabric=/gpfs/home/acad/ucl-tfl/poncelet/soft/lib-MPICH-4.2.2-OFI-1.22.0-CUDA-12.2.0-opt --enable-fast=O3,ndebug,alwaysinline --with-cuda=/gpfs/softs/easybuild/2023a/software/CUDA/12.2.0 --with-device=ch4:ofi --with-libfabric=/gpfs/home/acad/ucl-tfl/poncelet/soft/lib-MPICH-4.2.2-OFI-1.22.0-CUDA-12.2.0-opt
MPICH CC: gcc -DNDEBUG -DNVALGRIND -O3
MPICH CXX: g++ -DNDEBUG -DNVALGRIND -O3
MPICH F77: gfortran -O3
MPICH FC: gfortran -O3
MPICH features: threadcomm
For context, I am running a performance comparison of MPICH+libfabrics (using OSU microbenchmarks) vs libfabrics (using fabtests) alone. For both tests, the following environment variables are set:
# OFI env vars
export FI_PROVIDER="verbs,ofi_rxm,shm"
export FI_HMEM_CUDA_USE_GDRCOPY=1
export FI_OFI_RXM_BUFFER_SIZE=2048
export FI_OFI_RXM_SAR_LIMIT=2048
# MPICH env vars
export MPIR_CVAR_NOLOCAL=1
export MPIR_CVAR_ENABLE_GPU=1
export MPIR_CVAR_DEBUG_SUMMARY=1
Here is the result of running fi_bw -p "verbs;ofi_rxm" -D cuda
:
bytes iters total time MB/sec usec/xfer Mxfers/sec
64 64 4k 0.00s 45.01 1.42 0.70
256 64 16k 0.00s 93.09 2.75 0.36
1k 64 64k 0.00s 17.95 57.06 0.02
4k 64 256k 0.00s 3692.17 1.11 0.90
64k 64 4m 0.00s 17476.27 3.75 0.27
1m 64 64m 0.00s 19907.70 52.67 0.02
And that of running mpirun --bind-to core ${omb_dir}/pt2pt/osu_bw D D
:
# OSU MPI-CUDA Bandwidth Test v7.4
# Datatype: MPI_CHAR.
# Size Bandwidth (MB/s)
1 0.07
2 0.14
4 0.29
8 0.57
16 1.15
32 2.31
64 4.65
128 9.32
256 18.51
512 36.00
1024 71.42
2048 145.81
4096 264.82
8192 513.88
16384 953.50
32768 1721.26
65536 2937.49
131072 1696.09
262144 2103.23
524288 2262.00
1048576 2258.56
2097152 2403.49
4194304 3391.54
which shows a pretty big bandiwidth difference with and without the MPICH layer.
Now, in an attempt to improve this, we set MPIR_CVAR_CH4_OFI_ENABLE_HMEM=1
, which lead to the following segfault:
==== GPU Init (CUDA) ====
device_count: 1
CUDA_VISIBLE_DEVICES: 0
=========================
==== Various sizes and limits ====
sizeof(MPIDI_per_vci_t): 192
Required minimum FI_VERSION: 0, current version: 10016
==== GPU Init (CUDA) ====
device_count: 1
CUDA_VISIBLE_DEVICES: 0
=========================
provider: verbs, score = 0, pref = 0, FI_SOCKADDR_IN [16] 10.93.4.219
provider: verbs, score = 0, pref = 0, FI_SOCKADDR_IB [48]
provider: verbs, score = 0, pref = 0, FI_SOCKADDR_IN [16] 10.93.4.219
provider: verbs, score = 0, pref = 0, FI_SOCKADDR_IB [48]
provider: verbs, score = 0, pref = 0, FI_SOCKADDR_IN [16] 10.93.4.219
provider: verbs, score = 0, pref = 0, FI_SOCKADDR_IB [48]
provider: verbs, score = 0, pref = 0, FI_SOCKADDR_IN [16] 10.93.4.219
provider: verbs, score = 0, pref = 0, FI_SOCKADDR_IB [48]
provider: verbs, score = 0, pref = 0, FI_FORMAT_UNSPEC [32]
provider: verbs, score = 0, pref = 0, FI_FORMAT_UNSPEC [32]
provider: verbs, score = 0, pref = 0, FI_SOCKADDR_IB [48]
provider: verbs, score = 0, pref = 0, FI_SOCKADDR_IB [48]
provider: verbs, score = 0, pref = 0, FI_SOCKADDR_IB [48]
provider: verbs, score = 0, pref = 0, FI_SOCKADDR_IB [48]
provider: verbs, score = 0, pref = 0, FI_FORMAT_UNSPEC [32]
provider: verbs, score = 0, pref = 0, FI_FORMAT_UNSPEC [32]
provider: verbs;ofi_rxm, score = 0, pref = 0, FI_SOCKADDR_IN [16] 10.93.4.219
provider: verbs;ofi_rxm, score = 0, pref = 0, FI_SOCKADDR_IB [48]
provider: verbs;ofi_rxm, score = 5, pref = 0, FI_SOCKADDR_IN [16] 10.93.4.219
provider: verbs;ofi_rxm, score = 5, pref = 0, FI_SOCKADDR_IB [48]
provider: verbs;ofi_rxm, score = 0, pref = 0, FI_SOCKADDR_IB [48]
provider: verbs;ofi_rxm, score = 5, pref = 0, FI_SOCKADDR_IB [48]
provider: verbs;ofi_rxm, score = 0, pref = 0, FI_SOCKADDR_IN [16] 10.93.4.219
provider: verbs;ofi_rxm, score = 0, pref = 0, FI_SOCKADDR_IB [48]
provider: verbs;ofi_rxm, score = 0, pref = 0, FI_SOCKADDR_IN [16] 10.93.4.219
provider: verbs;ofi_rxm, score = 0, pref = 0, FI_SOCKADDR_IB [48]
provider: verbs;ofi_rxm, score = 0, pref = 0, FI_SOCKADDR_IN [16] 10.93.4.219
provider: verbs;ofi_rxm, score = 0, pref = 0, FI_SOCKADDR_IB [48]
provider: verbs;ofi_rxm, score = 0, pref = 0, FI_SOCKADDR_IN [16] 10.93.4.219
provider: verbs;ofi_rxm, score = 0, pref = 0, FI_SOCKADDR_IB [48]
provider: verbs;ofi_rxm, score = 0, pref = 0, FI_SOCKADDR_IB [48]
provider: verbs;ofi_rxm, score = 0, pref = 0, FI_SOCKADDR_IB [48]
provider: verbs;ofi_rxm, score = 0, pref = 0, FI_SOCKADDR_IB [48]
provider: verbs;ofi_rxm, score = 0, pref = 0, FI_SOCKADDR_IB [48]
provider: verbs;ofi_rxm, score = 0, pref = 0, FI_SOCKADDR_IN [16] 10.93.4.219
provider: verbs;ofi_rxm, score = 0, pref = 0, FI_SOCKADDR_IB [48]
provider: verbs;ofi_rxm, score = 5, pref = 0, FI_SOCKADDR_IN [16] 10.93.4.219
provider: verbs;ofi_rxm, score = 5, pref = 0, FI_SOCKADDR_IB [48]
provider: verbs;ofi_rxm, score = 0, pref = 0, FI_SOCKADDR_IB [48]
provider: verbs;ofi_rxm, score = 5, pref = 0, FI_SOCKADDR_IB [48]
provider: verbs;ofi_rxm, score = 0, pref = 0, FI_SOCKADDR_IN [16] 10.93.4.219
provider: verbs;ofi_rxm, score = 0, pref = 0, FI_SOCKADDR_IB [48]
provider: verbs;ofi_rxm, score = 0, pref = 0, FI_SOCKADDR_IN [16] 10.93.4.219
provider: verbs;ofi_rxm, score = 0, pref = 0, FI_SOCKADDR_IB [48]
provider: verbs;ofi_rxm, score = 0, pref = 0, FI_SOCKADDR_IB [48]
provider: verbs;ofi_rxm, score = 0, pref = 0, FI_SOCKADDR_IB [48]
provider: verbs;ofi_rxd, score = 0, pref = -2, FI_FORMAT_UNSPEC [32]
provider: verbs;ofi_rxd, score = 0, pref = -2, FI_FORMAT_UNSPEC [32]
provider: verbs;ofi_rxd, score = 0, pref = -2, FI_FORMAT_UNSPEC [32]
provider: verbs;ofi_rxd, score = 0, pref = -2, FI_FORMAT_UNSPEC [32]
provider: shm, score = 0, pref = -2, FI_ADDR_STR [17] - fi_shm://2755918
provider: shm, score = 5, pref = -2, FI_ADDR_STR [17] - fi_shm://2755918
Required minimum FI_VERSION: 10006, current version: 10016
==== Capability set configuration ====
libfabric provider: verbs;ofi_rxm - IB-0xfe80000000000000
MPIDI_OFI_ENABLE_DATA: 0
MPIDI_OFI_ENABLE_AV_TABLE: 1
MPIDI_OFI_ENABLE_SCALABLE_ENDPOINTS: 0
MPIDI_OFI_ENABLE_SHARED_CONTEXTS: 0
MPIDI_OFI_ENABLE_MR_VIRT_ADDRESS: 1
MPIDI_OFI_ENABLE_MR_ALLOCATED: 1
MPIDI_OFI_ENABLE_MR_REGISTER_NULL: 1
MPIDI_OFI_ENABLE_MR_PROV_KEY: 1
MPIDI_OFI_ENABLE_TAGGED: 1
MPIDI_OFI_ENABLE_AM: 1
MPIDI_OFI_ENABLE_RMA: 1
MPIDI_OFI_ENABLE_ATOMICS: 0
MPIDI_OFI_FETCH_ATOMIC_IOVECS: 1
MPIDI_OFI_ENABLE_DATA_AUTO_PROGRESS: 0
MPIDI_OFI_ENABLE_CONTROL_AUTO_PROGRESS: 0
MPIDI_OFI_ENABLE_PT2PT_NOPACK: 1
MPIDI_OFI_ENABLE_TRIGGERED: 0
MPIDI_OFI_ENABLE_HMEM: 1
MPIDI_OFI_NUM_AM_BUFFERS: 8
MPIDI_OFI_NUM_OPTIMIZED_MEMORY_REGIONS: 0
MPIDI_OFI_CONTEXT_BITS: 16
MPIDI_OFI_SOURCE_BITS: 23
MPIDI_OFI_TAG_BITS: 20
MPIDI_OFI_VNI_USE_DOMAIN: 1
MAXIMUM SUPPORTED RANKS: 8388608
MAXIMUM TAG: 1048576
==== Provider global thresholds ====
max_buffered_send: 192
max_buffered_write: 192
max_msg_size: 1073741824
max_order_raw: 1073741824
max_order_war: 0
max_order_waw: 1073741824
tx_iov_limit: 4
rx_iov_limit: 4
rma_iov_limit: 1
max_mr_key_size: 4
==== Various sizes and limits ====
MPIDI_OFI_AM_MSG_HEADER_SIZE: 24
MPIDI_OFI_MAX_AM_HDR_SIZE: 255
sizeof(MPIDI_OFI_am_request_header_t): 416
sizeof(MPIDI_OFI_per_vci_t): 52480
MPIDI_OFI_AM_HDR_POOL_CELL_SIZE: 1024
MPIDI_OFI_DEFAULT_SHORT_SEND_SIZE: 16384
==== OFI dynamic settings ====
num_vcis: 1
num_nics: 1
======================================
error checking : enabled
QMPI : disabled
debugger support : disabled
thread level : MPI_THREAD_SINGLE
thread CS : per-vci
threadcomm : enabled
==== data structure summary ====
sizeof(MPIR_Comm): 1792
sizeof(MPIR_Request): 512
sizeof(MPIR_Datatype): 280
================================
# OSU MPI-CUDA Bandwidth Test v7.4
# Datatype: MPI_CHAR.
# Size Bandwidth (MB/s)
1 0.02
2 0.02
4 0.02
8 0.03
16 0.05
32 0.09
64 0.14
128 0.26
256 0.45
===================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= PID 2755918 RUNNING AT cna019
= EXIT CODE: 139
= CLEANING UP REMAINING PROCESSES
= YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
===================================================================================
[proxy:1@cna021.lucia.cenaero.be] HYD_pmcd_pmip_control_cmd_cb (proxy/pmip_cb.c:484): assert (!closed) failed
[proxy:1@cna021.lucia.cenaero.be] HYDT_dmxu_poll_wait_for_event (lib/tools/demux/demux_poll.c:76): callback returned error status
[proxy:1@cna021.lucia.cenaero.be] main (proxy/pmip.c:122): demux engine error waiting for event
srun: error: cna021: task 1: Exited with exit code 7
YOUR APPLICATION TERMINATED WITH THE EXIT STRING: Segmentation fault (signal 11)
This typically refers to a problem with your application.
Please see the FAQ page for debugging suggestions
Update: we tried this with another ping-pong bw test, and there's no issue... so it's looking likely that the issue is actually lying with the osu benchmarks rather than MPICH itself.
Results obtained for reference:
[0] Transfer size (B): 8, Transfer Time (s): 0.000014462, Bandwidth (GB/s): 0.000515191
[0] Transfer size (B): 16, Transfer Time (s): 0.000013622, Bandwidth (GB/s): 0.001093902
[0] Transfer size (B): 32, Transfer Time (s): 0.000014496, Bandwidth (GB/s): 0.002055962
[0] Transfer size (B): 64, Transfer Time (s): 0.000014030, Bandwidth (GB/s): 0.004248468
[0] Transfer size (B): 128, Transfer Time (s): 0.000015202, Bandwidth (GB/s): 0.007841798
[0] Transfer size (B): 256, Transfer Time (s): 0.000020035, Bandwidth (GB/s): 0.011900377
[0] Transfer size (B): 512, Transfer Time (s): 0.000037671, Bandwidth (GB/s): 0.012658082
[0] Transfer size (B): 1024, Transfer Time (s): 0.000093290, Bandwidth (GB/s): 0.010222732
[0] Transfer size (B): 2048, Transfer Time (s): 0.000148286, Bandwidth (GB/s): 0.012862617
[0] Transfer size (B): 4096, Transfer Time (s): 0.000017991, Bandwidth (GB/s): 0.212033525
[0] Transfer size (B): 8192, Transfer Time (s): 0.000017847, Bandwidth (GB/s): 0.427500166
[0] Transfer size (B): 16384, Transfer Time (s): 0.000018190, Bandwidth (GB/s): 0.838870671
[0] Transfer size (B): 32768, Transfer Time (s): 0.000018811, Bandwidth (GB/s): 1.622305501
[0] Transfer size (B): 65536, Transfer Time (s): 0.000020096, Bandwidth (GB/s): 3.037141569
[0] Transfer size (B): 131072, Transfer Time (s): 0.000023546, Bandwidth (GB/s): 5.184375161
[0] Transfer size (B): 262144, Transfer Time (s): 0.000028957, Bandwidth (GB/s): 8.431129034
[0] Transfer size (B): 524288, Transfer Time (s): 0.000039995, Bandwidth (GB/s): 12.208517637
[0] Transfer size (B): 1048576, Transfer Time (s): 0.000063189, Bandwidth (GB/s): 15.454633071
[0] Transfer size (B): 2097152, Transfer Time (s): 0.000108264, Bandwidth (GB/s): 18.040372009
[0] Transfer size (B): 4194304, Transfer Time (s): 0.000201065, Bandwidth (GB/s): 19.427837563
[0] Transfer size (B): 8388608, Transfer Time (s): 0.000387005, Bandwidth (GB/s): 20.187066211
[0] Transfer size (B): 16777216, Transfer Time (s): 0.000754742, Bandwidth (GB/s): 20.702448315
[0] Transfer size (B): 33554432, Transfer Time (s): 0.001491261, Bandwidth (GB/s): 20.955424526
[0] Transfer size (B): 67108864, Transfer Time (s): 0.003488139, Bandwidth (GB/s): 17.917864812
[0] Transfer size (B): 134217728, Transfer Time (s): 0.006919051, Bandwidth (GB/s): 18.066061737
[0] Transfer size (B): 268435456, Transfer Time (s): 0.013780443, Bandwidth (GB/s): 18.141652445
[0] Transfer size (B): 536870912, Transfer Time (s): 0.027505422, Bandwidth (GB/s): 18.178234240
[0] Transfer size (B): 1073741824, Transfer Time (s): 0.054889985, Bandwidth (GB/s): 18.218259620