HRNet/DEKR

installation issue: ncclSystemError: System call (socket, malloc, munmap, etc) failed.

looninho opened this issue · 3 comments

Hi,

thank you for sharing your work.

I'm trying to test DEKR but facing with NCLL issue. When I run the train.py, it returns error:

RuntimeError: NCCL error in: /pytorch/torch/lib/c10d/ProcessGroupNCCL.cpp:825, unhandled system error, NCCL version 2.7.8

ncclSystemError: System call (socket, malloc, munmap, etc) failed.

Could you give me some tips to overcome this?

Environment:
CUDA:
GPU:

  • NVIDIA GTX 1080 Ti PCIE-11GB
  • NVIDIA GTX 1080 Ti PCIE-11GB
  • NVIDIA GTX Titan PCIE-12GB
  • -NVIDIA GTX Titan PCIE-12GB
  • version: 10.2

System:

  • OS: Ubuntu 18.04
  • architecture:
  • 64bit
  • processor: x86_64
  • python: 3.6.9

I met the same problem.Do you know how to solve it now?
Thanks a lot if you can inform me!!!

Hi @longpeace,

I solved the ncclSystemError issue by adding --ipc=host flag in the docker command.

[SOLVED]