Pytorch 1.4.0 with RTX 3080 ?

Question

Pytorch 1.4.0 with RTX 3080 ?

MiriamJo opened this issue 3 years ago · 2 comments

First problem solved, please see last thread for recent error

I want to train this network on my RTX 3080. I just tried to install pytorch1.4.0 , since its in the requirements.
I already compiled everything with a higher version of pytorch, but some errors occur when i want to start the training process. Can somebody tell me if its possible to work with newer versions of pytorch or do i HAVE to install pytorch 1.4 with cuda 10.1?
Since this pytorch version isnt compatible woith cuda 10.2 i assume higher versions are also okay..

when trying to install apex with cuda 10.1 or 10.2:

raise ValueError(f"Unknown CUDA arch ({arch}) or GPU not supported")
  ValueError: Unknown CUDA arch (compute) or GPU not supported
  error: subprocess-exited-with-error

Somehow i also get an error when i try to install pytorch version 1.4.0 with wither cuda 10.1 or cuda 10.2 in conda env:

UnsatisfiableError: The following specifications were found to be incompatible with your system:

  - feature:/linux-64::__glibc==2.31=0
 - cudatoolkit=11 -> __glibc[version='>=2.17,<3.0.a0']

It is working if i use cuda 11.3, but its explicitly said to use Cuda 10.1 or 10.2 for this repo so I gues sthe error when training comes from the wrong cuda version?

When I run it with newer pytorch version:

RuntimeError: Found no NVIDIA driver on your system. Please check that you have an NVIDIA GPU and installed a driver from http://www.nvidia.com/Download/index.aspx
\ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 15685) of binary: 
 ...
\ File "/home/q/anaconda3/envs/ffb6d/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 259, in launch_agent
 raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
train_custom.py FAILED
------------------------------------------------------------
Failures:
<NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2022-05-13_03:18:36
  host      : DESKTOP-NE1AHQS.
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 15685)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

Answer 1 · 2022-05-16T15:26:05.000Z

I fixed the GPU not available mistake byremoving all cuda and nvidia driver installs and installing just 11.7 wsl-ubuntu cuda-toolkit.

Moreover, I used conda environment to install mamba in it. This way i wasn't getting compiler errors when installing the requirements. I focussed on installing opencv3 instead of 4 and compiled pytorch with cuda 11.3 (wich is a driver available for wsl). Hope it might help someone with the same problem.

After pointing my nvcc to 11.3 I managed to successfully compile apex and normalspeed.

Now whats left is the torch.distributed.elastic.multiprocessing.errors.ChildFailedError, but a but different:

ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -9) local_rank: 0 (pid: 1338) of binary: /home/q/anaconda3/envs/ffb6d/bin/python3
Traceback (most recent call last):
  File "/home/q/anaconda3/envs/ffb6d/lib/python3.6/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/home/q/anaconda3/envs/ffb6d/lib/python3.6/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/home/q/anaconda3/envs/ffb6d/lib/python3.6/site-packages/torch/distributed/launch.py", line 193, in <module>
    main()
  File "/home/q/anaconda3/envs/ffb6d/lib/python3.6/site-packages/torch/distributed/launch.py", line 189, in main
    launch(args)
  File "/home/q/anaconda3/envs/ffb6d/lib/python3.6/site-packages/torch/distributed/launch.py", line 174, in launch
    run(args)
  File "/home/q/anaconda3/envs/ffb6d/lib/python3.6/site-packages/torch/distributed/run.py", line 713, in run
    )(*cmd_args)
  File "/home/q/anaconda3/envs/ffb6d/lib/python3.6/site-packages/torch/distributed/launcher/api.py", line 131, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home/q/anaconda3/envs/ffb6d/lib/python3.6/site-packages/torch/distributed/launcher/api.py", line 261, in launch_agent
    failures=result.failures,
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
=====================================================
train_lm.py FAILED
-----------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
-----------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2022-05-16_23:37:52
  host      : DESKTOP-NE1AHQS.
  rank      : 0 (local_rank: 0)
  exitcode  : -9 (pid: 1338)
  error_file: <N/A>
  traceback : Signal 9 (SIGKILL) received by PID 1338
=====================================================

Answer 2 · 2022-05-17T00:56:01.000Z

Update:
It was just my mistake. I totally forgot to install the respective CudNN driver since I removed everything before. Now It works and starts the training process.