openucx/ucx

rocm_ipc_md.c:79 UCX ERROR Failed to create ipc for 0x7f9030c18000/8000

jinz2014 opened this issue · 14 comments

Describe the issue

[1724610589.249079] [cousteau:2779987:0] rocm_ipc_md.c:79 UCX ERROR Failed to create ipc for 0x7f9030c18000/8000
[1724610589.249092] [cousteau:2779986:0] rocm_ipc_md.c:79 UCX ERROR Failed to create ipc for 0x7fd7af610000/8000
[cousteau:2779987:0:2779987] rndv.c:1872 Assertion sreq->send.rndv.lanes_count > 0' failed [cousteau:2779986:0:2779986] rndv.c:1872 Assertion sreq->send.rndv.lanes_count > 0' failed

Steps to Reproduce

export INSTALL_DIR=$HOME/ompi_for_gpu
export BUILD_DIR=/tmp/ompi_for_gpu_build
mkdir -p $BUILD_DIR

export UCX_DIR=$INSTALL_DIR/ucx
cd $BUILD_DIR
git clone https://github.com/openucx/ucx.git -b v1.15.x
cd ucx
./autogen.sh
mkdir build
cd build
../configure -prefix=$UCX_DIR
--with-rocm=/opt/rocm
make -j $(nproc)
make -j $(nproc) install

export OMPI_DIR=$INSTALL_DIR/ompi
cd $BUILD_DIR
git clone --recursive https://github.com/open-mpi/ompi.git
-b v5.0.x
cd ompi
./autogen.pl
mkdir build
cd build
../configure --prefix=$OMPI_DIR --with-ucx=$UCX_DIR
--with-rocm=/opt/rocm
make -j $(nproc)
make install

export LD_LIBRARY_PATH=$OMPI_DIR/lib:$UCX_DIR/lib:/opt/rocm/lib
export PATH=$OMPI_DIR/bin:$PATH

The example is in https://github.com/zjin-lcf/HeCBench/tree/master/src/allreduce-hip

make run

Setup and versions

  • GPU: AMD MI100
  • ROCm: 6.0.2

@jinz2014 this is most likely a system setup / permission issue on your side, since UCX 1.15 has been used extensively with numerous application on MI100.

Can you please check the following things:

The answers are yes to both questions.
I didn't paste the result completely. The program starts to produce error message after initial successful execution

Verified allreduce for size 0 (19.865 us per iteration)
Verified allreduce for size 32 (52.7884 us per iteration)
Verified allreduce for size 256 (94.3108 us per iteration)
Verified allreduce for size 1024 (73.2143 us per iteration)
Verified allreduce for size 4096 (88.3691 us per iteration)
[1724605863.595828] [cousteau:2757379:0] rocm_ipc_md.c:79 UCX ERROR Failed to create ipc for 0x7f8d0fa10000/8000
[1724605863.595828] [cousteau:2757380:0] rocm_ipc_md.c:79 UCX ERROR Failed to create ipc for 0x7effaba18000/8000
[cousteau:2757380:0:2757380] rndv.c:1872 Assertion sreq->send.rndv.lanes_count > 0' failed [cousteau:2757379:0:2757379] rndv.c:1872 Assertion sreq->send.rndv.lanes_count > 0' failed
==== backtrace (tid:2757380) ====
0 /home/user/ompi_for_gpu/ucx/lib/libucs.so.0(ucs_handle_error+0x2e4) [0x7f01d67bbd84]
1 /home/user/ompi_for_gpu/ucx/lib/libucs.so.0(ucs_fatal_error_message+0xc2) [0x7f01d67b8dc2]
2 /home/user/ompi_for_gpu/ucx/lib/libucs.so.0(ucs_fatal_error_format+0x11a) [0x7f01d67b8eea]
3 /home/user/ompi_for_gpu/ucx/lib/libucp.so.0(ucp_rndv_progress_rma_put_zcopy+0x1b8) [0x7f01d68a8a08]
4 /home/user/ompi_for_gpu/ucx/lib/libucp.so.0(ucp_rndv_atp_handler+0x217) [0x7f01d68a9ac7]
5 /home/user/ompi_for_gpu/ucx/lib/libuct.so.0(+0x1c6ad) [0x7f01cd7916ad]
6 /home/user/ompi_for_gpu/ucx/lib/libucp.so.0(ucp_worker_progress+0x3a) [0x7f01d6859e3a]
7 /home/user/ompi_for_gpu/ompi/lib/libmpi.so.40(mca_pml_ucx_send+0x1bf) [0x7f01d8bd21df]
8 /home/user/ompi_for_gpu/ompi/lib/libmpi.so.40(MPI_Send+0x183) [0x7f01d8a59b63]
9 ./main() [0x206a9f]
10 ./main() [0x205b6a]
11 ./main() [0x2060cd]
12 /lib/x86_64-linux-gnu/libc.so.6(+0x29d90) [0x7f01d698ed90]
13 /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x80) [0x7f01d698ee40]
14 ./main() [0x205835]

Could you please provide the full command line that you used? I see that the put_zcopy protocol is being utilized, which is not the default with 1.15, it should be the get_zcopy protocol.

Sorry, I don't know the two protocols.

"make run" shows the full command:

$HOME/ompi_for_gpu/ompi/bin/mpirun -n 2 ./main

Thank you for the instructions.

So just for a test, could you change the command line to the following:

$HOME/ompi_for_gpu/ompi/bin/mpirun -x UCX_RNDV_SCHEME=get_zcopy -n 2 ./main

to see whether it makes a difference?

Ok.

$HOME/ompi_for_gpu/ompi/bin/mpirun -x UCX_RNDV_SCHEME=get_zcopy -n 2 ./main
[1724695907.879444] [cousteau:3183448:0] parser.c:2036 UCX WARN unused environment variable: UCX_DIR (maybe: UCX_TLS?)
[1724695907.879444] [cousteau:3183448:0] parser.c:2036 UCX WARN (set UCX_WARN_UNUSED_ENV_VARS=n to suppress this warning)
[1724695907.897366] [cousteau:3183447:0] parser.c:2036 UCX WARN unused environment variable: UCX_DIR (maybe: UCX_TLS?)
[1724695907.897366] [cousteau:3183447:0] parser.c:2036 UCX WARN (set UCX_WARN_UNUSED_ENV_VARS=n to suppress this warning)
Verified allreduce for size 0 (20.0202 us per iteration)
Verified allreduce for size 32 (52.4041 us per iteration)
Verified allreduce for size 256 (91.5858 us per iteration)
Verified allreduce for size 1024 (67.5217 us per iteration)
Verified allreduce for size 4096 (79.6616 us per iteration)
[1724695938.071148] [cousteau:3183447:0] rocm_ipc_md.c:79 UCX ERROR Failed to create ipc for 0x7fa55f810000/8000
[1724695938.071145] [cousteau:3183448:0] rocm_ipc_md.c:79 UCX ERROR Failed to create ipc for 0x7faefc618000/8000
[1724695938.071299] [cousteau:3183448:0] rocm_ipc_md.c:79 UCX ERROR Failed to create ipc for 0x7faefc620000/8000
[1724695938.071304] [cousteau:3183447:0] rocm_ipc_md.c:79 UCX ERROR Failed to create ipc for 0x7fa55f820000/8000
[1724695938.071585] [cousteau:3183447:0] rocm_ipc_md.c:79 UCX ERROR Failed to create ipc for 0x7fa55f818000/8000
[1724695938.071597] [cousteau:3183447:0] rocm_ipc_md.c:79 UCX ERROR Failed to create ipc for 0x7fa55f810000/8000

Hm. Ok, I will see whether I can reproduce the issue locally. Are there instructions on how to compile the testcode on the github repo?

export INSTALL_DIR=$HOME/ompi_for_gpu
export BUILD_DIR=/tmp/ompi_for_gpu_build
mkdir -p $BUILD_DIR

export UCX_DIR=$INSTALL_DIR/ucx
cd $BUILD_DIR
git clone https://github.com/openucx/ucx.git -b v1.15.x
cd ucx
./autogen.sh
mkdir build
cd build
../configure -prefix=$UCX_DIR
--with-rocm=/opt/rocm
make -j $(nproc)
make -j $(nproc) install

export OMPI_DIR=$INSTALL_DIR/ompi
cd $BUILD_DIR
git clone --recursive https://github.com/open-mpi/ompi.git
-b v5.0.x
cd ompi
./autogen.pl
mkdir build
cd build
../configure --prefix=$OMPI_DIR --with-ucx=$UCX_DIR
--with-rocm=/opt/rocm
make -j $(nproc)
make install

export LD_LIBRARY_PATH=$OMPI_DIR/lib:$UCX_DIR/lib:/opt/rocm/lib
export PATH=$OMPI_DIR/bin:$PATH

The example is in https://github.com/zjin-lcf/HeCBench/tree/master/src/allreduce-hip

make run

will build and run the program.

The CUDA example is migrated to the HIP example. I didn't observe errors when running the CUDA code, so am not clear where the issue in the HIP example is.
https://github.com/zjin-lcf/HeCBench/tree/master/src/allreduce-cuda

Thanks

ok, so but just clarify, compiling the example is simply make run ( I am compiling UCX and Open MPI on a daily bases, that is not the challenge :-) )

make run
hipcc -std=c++14 -I$HOME/ompi_for_gpu/ompi/include -DOMPI_SKIP_MPICXX= -Wall -O3 -c main.cu -o main.o
hipcc -std=c++14 -I$HOME/ompi_for_gpu/ompi/include -DOMPI_SKIP_MPICXX= -Wall -O3 -c collectives.cu -o collectives.o
hipcc -std=c++14 -I$HOME/ompi_for_gpu/ompi/include -DOMPI_SKIP_MPICXX= -Wall -O3 -c timer.cu -o timer.o
hipcc -std=c++14 -I$HOME/ompi_for_gpu/ompi/include -DOMPI_SKIP_MPICXX= -Wall -O3 main.o collectives.o timer.o -o main -L$HOME/ompi_for_gpu/ompi/lib -lmpi -DOMPI_SKIP_MPICXX=
$HOME/ompi_for_gpu/ompi/bin/mpirun -x UCX_RNDV_SCHEME=get_zcopy -n 2 ./main

The original CUDA code is https://github.com/baidu-research/baidu-allreduce

I can confirm that I can reproduce the issue. It is in my case an MI250X system with ROCm 6.2 and UCX 1.16 (that is my default development platform at the moment), but the same error is occurring. I will put it on my list of items to work on, but it might be more towards the end of the week until I get to it.

Okay.

I think I know what the issue is, but I do not know yet whether its something that we are doing wrong in the rocm components of UCX or whether its a bug in ROCm runtime layer.

I have however a quick workaround in your code (since a proper fix might take a while):

If you allocate the output buffer outside of the RingAllreduce test and pass it in as an argument to RingAllreduce (e.g. allocate just right before the for(size_t iter = 0; iter < iters; iter++) loop and perform in the for loop body a hipMemset(output, 0, size * sizeof(float)) before calling RingAllreduce), you avoid the hipMalloc() + hipFree() of the buffer for every iteration (and do it just once for every message size). With this modification, the test passes for me.

Let me emphasize that your code is however correct, and it should work.

I added another example
https://github.com/zjin-lcf/HeCBench/blob/master/src/pingpong-hip/main.cu
Does running the example cause similar errors ?

Thank you for the workaround.