Training on A6000 with CUDA 11.7

Question

Training on A6000 with CUDA 11.7

DarylWM opened this issue 2 years ago · 0 comments

Hello. Thank you for sharing this code; I think it's really interesting work and I'm keen to try it for my dataset. To get started, I'd like to train the example on an A6000 using CUDA 11.7. I tried the specified versions in the environment.yml but need to bump up the CUDA version to get A6000 support. I get this error:

experiments/experiment_1/
backing up... done.
train.py:1301: DeprecationWarning: Starting with ImageIO v3 the behavior of this function will switch to that of iio.v3.imread. To keep the current behavior (and make this warning disappear) use `import imageio.v2 as imageio` or call `imageio.v2.imread` directly.
  return imageio.imread(f, ignoregamma=True) if f[-4:] == ".png" else imageio.imread(f)
Loaded llff (86, 384, 512, 3) (120, 3, 5) [384.         512.         256.60952759] data/example_sequence/
DEFINING BOUNDS
NEAR FAR 0.0021997066447511314 1.0024441480636597
Found ckpts []
start: 0 args.N_iters: 200000
C:\Users\dwil6816\Anaconda3\envs\nrnerf\lib\site-packages\torch\functional.py:504: UserWarning: torch.meshgrid: in an upcoming release, it will be required to pass the indexing argument. (Triggered internally at C:\actions-runner\_work\pytorch\pytorch\builder\windows\pytorch\aten\src\ATen\native\TensorShape.cpp:3191.)
  return _VF.meshgrid(tensors, **kwargs)  # type: ignore[attr-defined]
get rays
done, concats
(86, 384, 512, 4, 3)
TRAIN views are [ 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47
 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71
 72 73 74 75 76 77 78 79 80 81 82 83 84 85]
TEST views are []
VAL views are []
Begin
  0%|                                                                                       | 0/200000 [00:00<?, ?it/s]
Traceback (most recent call last):
  File "train.py", line 2016, in <module>
    main_function(args)
  File "train.py", line 1566, in main_function
    losses = parallel_training(
  File "C:\Users\dwil6816\Anaconda3\envs\nrnerf\lib\site-packages\torch\nn\modules\module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "C:\Users\dwil6816\Anaconda3\envs\nrnerf\lib\site-packages\torch\nn\parallel\data_parallel.py", line 169, in forward
    return self.module(*inputs[0], **kwargs[0])
  File "C:\Users\dwil6816\Anaconda3\envs\nrnerf\lib\site-packages\torch\nn\modules\module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "train.py", line 188, in forward
    imageid_to_timestepid[batch_pixel_indices[:, 0]], :
RuntimeError: indices should be either on cpu or on the same device as the indexed tensor (cpu)

Here are the Torch versions I'm using:

torch               1.13.1+cu117
torchaudio          0.13.1+cu117
torchvision         0.14.1+cu117

I'm new to Pytorch so I'm wondering whether there's a global fix or do I need to go through and check where the tensors are allocated?