Environment conflicts with GPU
Opened this issue · 3 comments
Hi, thanks a lot for your interest in those issues, I wanted to ask about your comment on the following issue when I want to train Stage1:
24-01-18 01:06:41.203 - INFO: [Phase 1] Training noise model!
24-01-18 01:07:04.744 - INFO: MRI dataset [hardi] is created.
24-01-18 01:07:23.001 - INFO: MRI dataset [hardi] is created.
24-01-18 01:07:23.001 - INFO: Initial Dataset Finished
/home/anar/mambaforge-pypy3/envs/ddm2_image/lib/python3.8/site-packages/torch/cuda/__init__.py:104: UserWarning:
NVIDIA RTX 6000 Ada Generation with CUDA capability sm_89 is not compatible with the current PyTorch installation.
The current PyTorch install supports CUDA capabilities sm_37 sm_50 sm_60 sm_70.
If you want to use the NVIDIA RTX 6000 Ada Generation GPU with PyTorch, please check the instructions at https://pytorch.org/get-started/locally/
warnings.warn(incompatible_device_warn.format(device_name, capability, " ".join(arch_list), device_name))
24-01-18 01:07:23.542 - INFO: Noise Model is created.
24-01-18 01:07:23.542 - INFO: Initial Model Finished
1.8.0 10.2
export CUDA_VISIBLE_DEVICES=2
Loaded data of size: (118, 118, 25, 56)
Loaded data of size: (118, 118, 25, 56)
dropout 0.0 encoder dropout 0.0
raw_input shape before slicing: (118, 118, 1, 3)
raw_input shape after slicing: (118, 118, 3)
raw_input shape before slicing: (118, 118, 1, 3)
raw_input shape after slicing: (118, 118, 3)
raw_input shape before slicing: (118, 118, 1, 3)
raw_input shape after slicing: (118, 118, 3)
raw_input shape before slicing: (118, 118, 1, 3)
raw_input shape after slicing: (118, 118, 3)
raw_input shape before slicing: (118, 118, 1, 3)
raw_input shape after slicing: (118, 118, 3)
raw_input shape before slicing: (118, 118, 1, 3)
raw_input shape after slicing: (118, 118, 3)
raw_input shape before slicing: (118, 118, 1, 3)
raw_input shape after slicing: (118, 118, 3)
raw_input shape before slicing: (118, 118, 1, 3)
raw_input shape after slicing: (118, 118, 3)
raw_input shape before slicing: (118, 118, 1, 3)
raw_input shape after slicing: (118, 118, 3)
raw_input shape before slicing: (118, 118, 1, 3)
raw_input shape after slicing: (118, 118, 3)
raw_input shape before slicing: (118, 118, 1, 3)
raw_input shape after slicing: (118, 118, 3)
raw_input shape before slicing: (118, 118, 1, 3)
raw_input shape after slicing: (118, 118, 3)
raw_input shape before slicing: (118, 118, 1, 3)
raw_input shape after slicing: (118, 118, 3)
raw_input shape before slicing: (118, 118, 1, 3)
raw_input shape after slicing: (118, 118, 3)
raw_input shape before slicing: (118, 118, 1, 3)
raw_input shape after slicing: (118, 118, 3)
raw_input shape before slicing: (118, 118, 1, 3)
raw_input shape after slicing: (118, 118, 3)
raw_input shape before slicing: (118, 118, 1, 3)
raw_input shape after slicing: (118, 118, 3)
raw_input shape before slicing: (118, 118, 1, 3)
raw_input shape after slicing: (118, 118, 3)
raw_input shape before slicing: (118, 118, 1, 3)
raw_input shape after slicing: (118, 118, 3)
raw_input shape before slicing: (118, 118, 1, 3)
raw_input shape after slicing: (118, 118, 3)
raw_input shape before slicing: (118, 118, 1, 3)
raw_input shape after slicing: (118, 118, 3)
raw_input shape before slicing: (118, 118, 1, 3)
raw_input shape after slicing: (118, 118, 3)
raw_input shape before slicing: (118, 118, 1, 3)
raw_input shape after slicing: (118, 118, 3)
raw_input shape before slicing: (118, 118, 1, 3)
raw_input shape after slicing: (118, 118, 3)
raw_input shape before slicing: (118, 118, 1, 3)
raw_input shape after slicing: (118, 118, 3)
raw_input shape before slicing: (118, 118, 1, 3)
raw_input shape after slicing: (118, 118, 3)
raw_input shape before slicing: (118, 118, 1, 3)
raw_input shape after slicing: (118, 118, 3)
raw_input shape before slicing: (118, 118, 1, 3)
raw_input shape after slicing: (118, 118, 3)
raw_input shape before slicing: (118, 118, 1, 3)
raw_input shape after slicing: (118, 118, 3)
raw_input shape before slicing: (118, 118, 1, 3)
raw_input shape after slicing: (118, 118, 3)
raw_input shape before slicing: (118, 118, 1, 3)
raw_input shape after slicing: (118, 118, 3)
raw_input shape before slicing: (118, 118, 1, 3)
raw_input shape after slicing: (118, 118, 3)
Traceback (most recent call last):
File "train_noise_model.py", line 72, in <module>
trainer.optimize_parameters()
File "/home/anar/DDM2/model/model_stage1.py", line 62, in optimize_parameters
outputs = self.netG(self.data)
File "/home/anar/mambaforge-pypy3/envs/ddm2_image/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/anar/DDM2/model/mri_modules/noise_model.py", line 44, in forward
return self.p_losses(x, *args, **kwargs)
File "/home/anar/DDM2/model/mri_modules/noise_model.py", line 36, in p_losses
x_recon = self.denoise_fn(x_in['condition'])
File "/home/anar/mambaforge-pypy3/envs/ddm2_image/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/anar/DDM2/model/mri_modules/unet.py", line 286, in forward
x = layer(x)
File "/home/anar/mambaforge-pypy3/envs/ddm2_image/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/anar/mambaforge-pypy3/envs/ddm2_image/lib/python3.8/site-packages/torch/nn/modules/conv.py", line 399, in forward
return self._conv_forward(input, self.weight, self.bias)
File "/home/anar/mambaforge-pypy3/envs/ddm2_image/lib/python3.8/site-packages/torch/nn/modules/conv.py", line 395, in _conv_forward
return F.conv2d(input, weight, bias, self.stride,
RuntimeError: CUDA error: no kernel image is available for execution on the device
Previously, when I was trying to denoise HARDI150 volumes, I didn't specify any PyTorch version and made Python>=3.10. But after noticing your initial environment.yaml
criteria, I changed to very specific cases for torch, torchvision, and python but frankly, I started to get the above issue. Do you think it is better I do not specify any version for PyTorch or they should exactly match?
The reason I ask this is because I feel like from the previous issue when the validation loader was not working, I thought maybe it happened due to version mismatches from the environment file but after getting the above problem, I am still very unsure on this as well.
@tiangexiang Any ideas on this?
Sorry for the late response! The error you reported particularly indicates a mismatch between pytorch version and CUDA version. And you are right that the validation loader failure is probably due to version mismatch as well. In this way, I do recommend duplicating the exact environment as specified in environment.yaml
, since it is guaranteed to work (be careful with the CUDA version though! It has to match your own hardware).
@tiangexiang Thanks for the reply. I checked very carefully and to match my hardware, I set up cudatoolkit=11.3
and the corresponding PyTorch versions as follows:
name: ddm2_experiment
channels:
- defaults
dependencies:
- _libgcc_mutex=0.1
- _openmp_mutex=4.5
- _pytorch_select=0.1
- blas=1.0
- ca-certificates=2022.3.29
- certifi=2021.10.8
- cudatoolkit=11.3
- freetype=2.11.0
- giflib=5.2.1
- intel-openmp=2021.4.0
- jpeg=9d
- lcms2=2.12
- ld_impl_linux-64=2.35.1
- libffi=3.3
- libgcc-ng=9.3.0
- libgomp=9.3.0
- libpng=1.6.37
- libstdcxx-ng=9.3.0
- libtiff=4.2.0
- libuv=1.40.0
- libwebp=1.2.2
- libwebp-base=1.2.2
- lz4-c=1.9.3
- mkl=2021.4.0
- mkl-service=2.4.0
- mkl_fft=1.3.1
- mkl_random=1.2.2
- ncurses=6.3
- ninja=1.10.2
- openssl=1.1.1n
- pip=21.2.4
- python=3.8.13
- readline=8.1.2
- setuptools=58.0.4
- six=1.16.0
- sqlite=3.38.2
- tk=8.6.11
- typing_extensions=4.1.1
- wheel=0.37.1
- xz=5.2.5
- zlib=1.2.11
- zstd=1.4.9
- pip:
- beautifulsoup4==4.11.1
- charset-normalizer==2.0.12
- cycler==0.11.0
- dipy==1.5.0
- filelock==3.6.0
- fonttools==4.31.2
- gdown==4.4.0
- h5py==3.6.0
- idna==3.3
- imageio==2.16.1
- joblib==1.1.0
- kiwisolver==1.4.2
- matplotlib==3.5.1
- networkx==2.7.1
- nibabel==3.2.2
- numpy==1.22.3
- opencv-python==4.5.4.58
- packaging==21.3
- pandas==1.4.1
- pillow==9.1.0
- pydicom==2.3.0
- pyparsing==3.0.7
- pysocks==1.7.1
- python-dateutil==2.8.2
- pytz==2022.1
- pywavelets==1.3.0
- pyyaml==6.0
- requests==2.27.1
- scikit-image==0.19.2
- scikit-learn==1.0.2
- scipy==1.8.0
- seaborn==0.11.2
- soupsieve==2.3.2.post1
- statannot==0.2.3
- threadpoolctl==3.1.0
- tifffile==2022.3.25
- timm==0.4.12
- torch==1.8.0
- torchvision==0.9.0
- tqdm==4.63.1
- urllib3==1.26.9
Even though the matching happened, I still had problems with the validation part of the training.
Validation
Traceback (most recent call last):
File "train_noise_model.py", line 92, in <module>
for _, val_data in enumerate(val_loader):
File "/home/anar/mambaforge-pypy3/envs/ddm2_experiment/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 681, in __next__
data = self._next_data()
File "/home/anar/mambaforge-pypy3/envs/ddm2_experiment/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1376, in _next_data
return self._process_data(data)
File "/home/anar/mambaforge-pypy3/envs/ddm2_experiment/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1402, in _process_data
data.reraise()
File "/home/anar/mambaforge-pypy3/envs/ddm2_experiment/lib/python3.8/site-packages/torch/_utils.py", line 461, in reraise
raise exception
IndexError: Caught IndexError in DataLoader worker process 0.
Original Traceback (most recent call last):
File "/home/anar/mambaforge-pypy3/envs/ddm2_experiment/lib/python3.8/site-packages/torch/utils/data/_utils/worker.py", line 302, in _worker_loop
data = fetcher.fetch(index)
File "/home/anar/mambaforge-pypy3/envs/ddm2_experiment/lib/python3.8/site-packages/torch/utils/data/_utils/fetch.py", line 49, in fetch
data = [self.dataset[idx] for idx in possibly_batched_index]
File "/home/anar/mambaforge-pypy3/envs/ddm2_experiment/lib/python3.8/site-packages/torch/utils/data/_utils/fetch.py", line 49, in <listcomp>
data = [self.dataset[idx] for idx in possibly_batched_index]
File "/home/anar/DDM2/data/mri_dataset.py", line 130, in __getitem__
raw_input = raw_input[:,:,0]
IndexError: index 0 is out of bounds for axis 2 with size 0
Even trying the latest versions for torch
& torchvision
did not help at all 🙁