NVIDIA/TransformerEngine

ncclIpcSocketSendFd failed in register_user_buffer_collective(alloc=true), --tp-comm-overlap

jingjie01ai opened this issue · 0 comments

  1. register_user_buffer_collective failed at ncclIpcSocketSendFd(...) if alloc=true
    [error msg]:
    UDS: Sending data over socket /tmp/nccl-socket-3-deadcafebeef failed : Connection refused (111)
    [code]:
    https://github.com/NVIDIA/TransformerEngine/blob/main/transformer_engine/pytorch/csrc/userbuffers/userbuffers-host.cpp#L427

  2. ncclIpcSocketSendFd(...) in create_communicator_grouped2 run success.
    [code]: https://github.com/NVIDIA/TransformerEngine/blob/main/transformer_engine/pytorch/csrc/userbuffers/userbuffers-host.cpp#L279

Questions:
Can I just alloc gpubuffer outside of register_user_buffer_collective(alloc=false)? I tried it and success.
What's the different between alloc buffer inside and outside?