Pytorch 1.4.0 with RTX 3080 ?
MiriamJo opened this issue · 2 comments
First problem solved, please see last thread for recent error
I want to train this network on my RTX 3080. I just tried to install pytorch1.4.0 , since its in the requirements.
I already compiled everything with a higher version of pytorch, but some errors occur when i want to start the training process. Can somebody tell me if its possible to work with newer versions of pytorch or do i HAVE to install pytorch 1.4 with cuda 10.1?
Since this pytorch version isnt compatible woith cuda 10.2 i assume higher versions are also okay..
when trying to install apex with cuda 10.1 or 10.2:
raise ValueError(f"Unknown CUDA arch ({arch}) or GPU not supported")
ValueError: Unknown CUDA arch (compute) or GPU not supported
error: subprocess-exited-with-error
Somehow i also get an error when i try to install pytorch version 1.4.0 with wither cuda 10.1 or cuda 10.2 in conda env:
UnsatisfiableError: The following specifications were found to be incompatible with your system:
- feature:/linux-64::__glibc==2.31=0
- cudatoolkit=11 -> __glibc[version='>=2.17,<3.0.a0']
It is working if i use cuda 11.3, but its explicitly said to use Cuda 10.1 or 10.2 for this repo so I gues sthe error when training comes from the wrong cuda version?
When I run it with newer pytorch version:
RuntimeError: Found no NVIDIA driver on your system. Please check that you have an NVIDIA GPU and installed a driver from http://www.nvidia.com/Download/index.aspx
\ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 15685) of binary:
...
\ File "/home/q/anaconda3/envs/ffb6d/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 259, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
train_custom.py FAILED
------------------------------------------------------------
Failures:
<NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2022-05-13_03:18:36
host : DESKTOP-NE1AHQS.
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 15685)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
I fixed the GPU not available mistake byremoving all cuda and nvidia driver installs and installing just 11.7 wsl-ubuntu cuda-toolkit.
Moreover, I used conda environment to install mamba in it. This way i wasn't getting compiler errors when installing the requirements. I focussed on installing opencv3 instead of 4 and compiled pytorch with cuda 11.3 (wich is a driver available for wsl). Hope it might help someone with the same problem.
After pointing my nvcc to 11.3 I managed to successfully compile apex and normalspeed.
Now whats left is the torch.distributed.elastic.multiprocessing.errors.ChildFailedError
, but a but different:
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -9) local_rank: 0 (pid: 1338) of binary: /home/q/anaconda3/envs/ffb6d/bin/python3
Traceback (most recent call last):
File "/home/q/anaconda3/envs/ffb6d/lib/python3.6/runpy.py", line 193, in _run_module_as_main
"__main__", mod_spec)
File "/home/q/anaconda3/envs/ffb6d/lib/python3.6/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/home/q/anaconda3/envs/ffb6d/lib/python3.6/site-packages/torch/distributed/launch.py", line 193, in <module>
main()
File "/home/q/anaconda3/envs/ffb6d/lib/python3.6/site-packages/torch/distributed/launch.py", line 189, in main
launch(args)
File "/home/q/anaconda3/envs/ffb6d/lib/python3.6/site-packages/torch/distributed/launch.py", line 174, in launch
run(args)
File "/home/q/anaconda3/envs/ffb6d/lib/python3.6/site-packages/torch/distributed/run.py", line 713, in run
)(*cmd_args)
File "/home/q/anaconda3/envs/ffb6d/lib/python3.6/site-packages/torch/distributed/launcher/api.py", line 131, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/q/anaconda3/envs/ffb6d/lib/python3.6/site-packages/torch/distributed/launcher/api.py", line 261, in launch_agent
failures=result.failures,
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
=====================================================
train_lm.py FAILED
-----------------------------------------------------
Failures:
<NO_OTHER_FAILURES>
-----------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2022-05-16_23:37:52
host : DESKTOP-NE1AHQS.
rank : 0 (local_rank: 0)
exitcode : -9 (pid: 1338)
error_file: <N/A>
traceback : Signal 9 (SIGKILL) received by PID 1338
=====================================================
Update:
It was just my mistake. I totally forgot to install the respective CudNN driver since I removed everything before. Now It works and starts the training process.