ncclIpcSocketSendFd failed in register_user_buffer_collective(alloc=true), --tp-comm-overlap
jingjie01ai opened this issue · 0 comments
jingjie01ai commented
-
register_user_buffer_collective failed at ncclIpcSocketSendFd(...) if alloc=true
[error msg]:
UDS: Sending data over socket /tmp/nccl-socket-3-deadcafebeef failed : Connection refused (111)
[code]:
https://github.com/NVIDIA/TransformerEngine/blob/main/transformer_engine/pytorch/csrc/userbuffers/userbuffers-host.cpp#L427 -
ncclIpcSocketSendFd(...) in create_communicator_grouped2 run success.
[code]: https://github.com/NVIDIA/TransformerEngine/blob/main/transformer_engine/pytorch/csrc/userbuffers/userbuffers-host.cpp#L279
Questions:
Can I just alloc gpubuffer outside of register_user_buffer_collective(alloc=false)? I tried it and success.
What's the different between alloc buffer inside and outside?