installation issue: ncclSystemError: System call (socket, malloc, munmap, etc) failed.
looninho opened this issue · 3 comments
looninho commented
Hi,
thank you for sharing your work.
I'm trying to test DEKR but facing with NCLL issue. When I run the train.py, it returns error:
RuntimeError: NCCL error in: /pytorch/torch/lib/c10d/ProcessGroupNCCL.cpp:825, unhandled system error, NCCL version 2.7.8
ncclSystemError: System call (socket, malloc, munmap, etc) failed.
Could you give me some tips to overcome this?
Environment:
CUDA:
GPU:
- NVIDIA GTX 1080 Ti PCIE-11GB
- NVIDIA GTX 1080 Ti PCIE-11GB
- NVIDIA GTX Titan PCIE-12GB
- -NVIDIA GTX Titan PCIE-12GB
- version: 10.2
System:
- OS: Ubuntu 18.04
- architecture:
- 64bit
- processor: x86_64
- python: 3.6.9
longpeace commented
I met the same problem.Do you know how to solve it now?
Thanks a lot if you can inform me!!!
looninho commented
Hi @longpeace,
I solved the ncclSystemError issue by adding --ipc=host
flag in the docker command.
looninho commented
[SOLVED]