MLOPTPSU/FedTorch

Errors when running the code using Docker

Closed this issue · 5 comments

Hi,
I followed your README instructions to run your algorithm but I encountered multiple errors along the way:

  • I tried to pull the image you provided on This issue #3 and run the following command:
    python run_mpi.py -f -ft apfl -n 10 -d cifar10 -lg 0.1 -b 50 -c 20 -k 1.0 -fs local_step -l 10 -r 2 -pa 0.5 -fp -oc
    I keep getting warnings about my RTX 2080ti card which I think is not been recognized.
warnings.warn(incompatible_device_warn.format(device_name, capability, " ".join(arch_list), device_name))
Traceback (most recent call last):
  File "main.py", line 49, in <module>
    main(args)
  File "main.py", line 21, in main
    client.initialize()
  File "/workspace/fedtorch/fedtorch/nodes/nodes.py", line 44, in initialize
    init_config(self.args)
  File "/workspace/fedtorch/fedtorch/utils/init_config.py", line 36, in init_config
    torch.cuda.set_device(args.graph.device)
  File "/usr/local/lib/python3.8/dist-packages/torch/cuda/__init__.py", line 281, in set_device
    torch._C._cuda_setDevice(device)
RuntimeError: cuda runtime error (101) : invalid device ordinal at /workspace/pytorch/torch/csrc/cuda/Module.cpp:59
THCudaCheck FAIL file=/workspace/pytorch/torch/csrc/cuda/Module.cpp line=59 error=101 : invalid device ordinal
/usr/local/lib/python3.8/dist-packages/torch/cuda/__init__.py:125: UserWarning:
GeForce RTX 2080 Ti with CUDA capability sm_75 is not compatible with the current PyTorch installation.
The current PyTorch install supports CUDA capabilities sm_35 sm_37 sm_52 sm_60 sm_61 sm_70 compute_70.
If you want to use the GeForce RTX 2080 Ti GPU with PyTorch, please check the instructions at https://pytorch.org/get-started/locally/

  warnings.warn(incompatible_device_warn.format(device_name, capability, " ".join(arch_list), device_name))
Traceback (most recent call last):
  File "main.py", line 49, in <module>
    main(args)
  File "main.py", line 21, in main
    client.initialize()
  File "/workspace/fedtorch/fedtorch/nodes/nodes.py", line 44, in initialize
    init_config(self.args)
  File "/workspace/fedtorch/fedtorch/utils/init_config.py", line 36, in init_config
    torch.cuda.set_device(args.graph.device)
  File "/usr/local/lib/python3.8/dist-packages/torch/cuda/__init__.py", line 281, in set_device
    torch._C._cuda_setDevice(device)
RuntimeError: cuda runtime error (101) : invalid device ordinal at /workspace/pytorch/torch/csrc/cuda/Module.cpp:59

Regarding these warnings the code does not collapse but not even one log regarding the training procedure is printed.

image

In addition when I run nvidia-smi I see the following gpu utilization:

image
The utilization percentage across all 3 GPUs remains zero.

  • As second option I tried to build the image given the Dockerfile located in the docker folder.
    Yet again I encountered the following run error:
Traceback (most recent call last):
  File "main.py", line 4, in <module>
    import torch.distributed as dist
  File "/usr/local/lib/python3.8/dist-packages/torch/__init__.py", line 189, in <module>
    from torch._C import *
RuntimeError: module compiled against API version 0xe but this version of numpy is 0xd

--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

  Process name: [[51835,1],1]
  Exit code:    1
--------------------------------------------------------------------------

It looks like there is a problem with PyTorch installation so I tried something simple:

image

Upgrading Numpy's version didn't help me either.
Can you please help and explain how to run your code?

Thank you!

That is a weird error since the docker is working fine in multiple setups I tested. Can you give more context on what setup you are running the code? Also, can you please run the python and import torch in another directory than /workspace?

As a side note, the MPI can run on multiple cores of a CPU or multiple GPUs. But, I do not think it is capable of running multiple processes on a single GPU. Hence, when you have 3 GPUs, the code right now can only run on three nodes, which means that -n should be 3 at most. You can run it on a single a CPU with multiple cores.

What do you mean by setup? I tried to run it on DGX server and other hardware setups. I tried to do what you suggested regarding pytorch it didn't help. Can you please have a look at it? This is the main concept of Docker to solve the environment & dependencies issues. I wrote you a detailed issue with screenshots and commands what else do you need?
Is it possible to reproduce the paper results in an easy way?

It seems that your GPU has some issues with the PyTorch installation. I cannot reproduce your error in my end. Whenever you are creating a docker file make sure the CUDA version and TORCH_CUDA_ARCH_LIST is compatible with your device.

Other than Docker, you can try to setup the environment yourself by installing MPI and then building the Torch.

On which GPUs do you recommend to run your code? I also saw that you install a specific version of pytorch, can't you install a newer one? BTW pytorch has support for multiprocessing and distributed training.
Don't you think to make your repo more accessible?

The code is based on torch distributed API. Please read the code more carefully. Also, the code is developed to be run on GPUs and CPUs using MPI, GLOO, and NCCL. For the PyTorch to be run with MPI it is needed to be built with MPI and CUDA dependencies. I have made a huge effort to make this code open-source and publicly available with a Docker container ready to use. If you do not appreciate the effort it is better not to use it.