RDMA_FORK_ SAFE
Opened this issue · 0 comments
visualxu commented
hi, I wrote a communication framework on our company's self-developed GPGPU, using the IB interface of GLOO. when using torch.utils.data.dataloader which forks many processes. I got following error:
gloo/transport/ibverbs/pair.cc:438] wc->status == IBV_WC_SUCCESS. 5 vs 0. Send for slot 0: Work Request Flushed Error
After debugging, I found that this problem was caused by fork's incomplete support for libibverbs.
https://www.rdmamojo.com/2012/05/24/ibv_fork_init/
I think we need to prompt users who are using the Infiniband interface to set the environment variable RDMA_FORK_SAFE or IBV_ FORK_SAFE, or call this interface when initializing IB like nccl (gloo/ibverbs/device. cc).