Training error on custom dataset a few epochs in after pre processing

Question

Training error on custom dataset a few epochs in after pre processing

gloomiebloomie opened this issue a year ago · 1 comments

command run: python main.py data/ian/ --workspace trial_ian/ -O --iters 200000
I'm trying to train my own model the path specified exists 95.png in my drive and it is mounted it is weird it worked on other epochs. Seems like colab may of just messed up but runtime was still connected.

Error: Namespace(path='data/ian/', O=True, test=False, test_train=False, data_range=[0, -1], workspace='trial_ian/', seed=0, iters=200000, lr=0.005, lr_net=0.0005, ckpt='latest', num_rays=65536, cuda_ray=True, max_steps=16, num_steps=16, upsample_steps=0, update_extra_interval=16, max_ray_batch=4096, fp16=True, lambda_amb=0.1, bg_img='', fbg=False, exp_eye=True, fix_eye=-1, smooth_eye=False, torso_shrink=0.8, color_space='srgb', preload=0, bound=1, scale=4, offset=[0, 0, 0], dt_gamma=0.00390625, min_near=0.05, density_thresh=10, density_thresh_torso=0.01, patch_size=1, finetune_lips=False, smooth_lips=False, torso=False, head_ckpt='', gui=False, W=450, H=450, radius=3.35, fovy=21.24, max_spp=1, att=2, aud='', emb=False, ind_dim=4, ind_num=10000, ind_dim_torso=8, amb_dim=2, part=False, part2=False, train_camera=False, smooth_path=False, smooth_path_window=7, asr=False, asr_wav='', asr_play=False, asr_model='cpierse/wav2vec2-large-xlsr-53-esperanto', asr_save_feats=False, fps=50, l=10, m=50, r=10)
[INFO] load 2030 train frames.
[INFO] load aud_features: torch.Size([2229, 44, 16])
Loading train data: 100% 2030/2030 [00:04<00:00, 503.51it/s]
[INFO] eye_area: 0.14190673828125 - 0.35076141357421875
Setting up [LPIPS] perceptual loss: trunk [alex], v[0.1], spatial [off]
/usr/local/lib/python3.10/dist-packages/torchvision/models/_utils.py:208: UserWarning: The parameter 'pretrained' is deprecated since 0.13 and may be removed in the future, please use 'weights' instead.
warnings.warn(
/usr/local/lib/python3.10/dist-packages/torchvision/models/utils.py:223: UserWarning: Arguments other than a weight enum or None for 'weights' are deprecated since 0.13 and may be removed in the future. The current behavior is equivalent to passing weights=AlexNet_Weights.IMAGENET1K_V1. You can also use weights=AlexNet_Weights.DEFAULT to get the most up-to-date weights.
warnings.warn(msg)
Downloading: "https://download.pytorch.org/models/alexnet-owt-7be5be79.pth" to /root/.cache/torch/hub/checkpoints/alexnet-owt-7be5be79.pth
100% 233M/233M [00:01<00:00, 138MB/s]
Loading model from: /usr/local/lib/python3.10/dist-packages/lpips/weights/v0.1/alex.pth
[INFO] Trainer: ngp | 2023-06-26_17-51-59 | cuda | fp16 | trial_ian/
[INFO] #parameters: 3024277
[INFO] Loading latest checkpoint ...
[WARN] No checkpoint found, model randomly initialized.
[INFO] load 100 val frames.
[INFO] load aud_features: torch.Size([2229, 44, 16])
Loading val data: 100% 100/100 [00:00<00:00, 485.16it/s]
[INFO] eye_area: 0.20160675048828125 - 0.33855438232421875
[INFO] max_epoch = 99
==> Start Training Epoch 1, lr=0.000500 ...
loss=0.0009 (0.0020), lr=0.000488: 100% 2030/2030 [04:27<00:00, 7.59it/s]
==> Finished Epoch 1.
==> Start Training Epoch 2, lr=0.000488 ...
loss=0.0004 (0.0010), lr=0.000477: 100% 2030/2030 [03:41<00:00, 9.17it/s]
==> Finished Epoch 2.
++> Evaluate at epoch 2 ...
loss=0.0005 (0.0006): 100% 100/100 [00:16<00:00, 5.96it/s]
PSNR = 32.244503
LPIPS (alex) = 0.073856
++> Evaluate epoch 2 Finished.
==> Start Training Epoch 3, lr=0.000477 ...
loss=0.0013 (0.0009), lr=0.000466: 100% 2030/2030 [03:48<00:00, 8.89it/s]
==> Finished Epoch 3.
==> Start Training Epoch 4, lr=0.000466 ...
loss=0.0004 (0.0009), lr=0.000455: 100% 2030/2030 [03:45<00:00, 9.02it/s]
==> Finished Epoch 4.
++> Evaluate at epoch 4 ...
loss=0.0004 (0.0005): 100% 100/100 [00:16<00:00, 6.16it/s]
PSNR = 32.979616
LPIPS (alex) = 0.063454
++> Evaluate epoch 4 Finished.
==> Start Training Epoch 5, lr=0.000455 ...
loss=0.0017 (0.0009), lr=0.000445: 100% 2030/2030 [03:41<00:00, 9.18it/s]
==> Finished Epoch 5.
==> Start Training Epoch 6, lr=0.000445 ...
loss=0.0008 (0.0008), lr=0.000435: 100% 2030/2030 [03:36<00:00, 9.38it/s]
==> Finished Epoch 6.
++> Evaluate at epoch 6 ...
loss=0.0004 (0.0006): 100% 100/100 [00:15<00:00, 6.64it/s]
PSNR = 32.863394
LPIPS (alex) = 0.060131
++> Evaluate epoch 6 Finished.
==> Start Training Epoch 7, lr=0.000435 ...
loss=0.0008 (0.0008), lr=0.000425: 100% 2030/2030 [03:38<00:00, 9.30it/s]
==> Finished Epoch 7.
==> Start Training Epoch 8, lr=0.000425 ...
loss=0.0008 (0.0008), lr=0.000415: 100% 2030/2030 [03:38<00:00, 9.28it/s]
==> Finished Epoch 8.
++> Evaluate at epoch 8 ...
loss=0.0005 (0.0006): 100% 100/100 [00:15<00:00, 6.56it/s]
PSNR = 32.951550
LPIPS (alex) = 0.056809
++> Evaluate epoch 8 Finished.
==> Start Training Epoch 9, lr=0.000415 ...
loss=0.0004 (0.0008), lr=0.000405: 100% 2030/2030 [03:39<00:00, 9.26it/s]
==> Finished Epoch 9.
==> Start Training Epoch 10, lr=0.000405 ...
loss=0.0005 (0.0008), lr=0.000396: 100% 2030/2030 [03:40<00:00, 9.22it/s]
==> Finished Epoch 10.
++> Evaluate at epoch 10 ...
loss=0.0005 (0.0006): 100% 100/100 [00:14<00:00, 6.74it/s]
PSNR = 32.993330
LPIPS (alex) = 0.055514
++> Evaluate epoch 10 Finished.
==> Start Training Epoch 11, lr=0.000396 ...
loss=0.0006 (0.0008), lr=0.000387: 100% 2030/2030 [03:41<00:00, 9.16it/s]
==> Finished Epoch 11.
==> Start Training Epoch 12, lr=0.000387 ...
loss=0.0002 (0.0007), lr=0.000380: 76% 1534/2030 [02:48<00:58, 8.41it/s][ WARN:0@2748.867] global loadsave.cpp:244 findDecoder imread('data/ian/torso_imgs/95.png'): can't open/read file: check file path/integrity
Traceback (most recent call last):
File "/content/drive/MyDrive/RAD-NeRF/main.py", line 235, in
File "/content/drive/MyDrive/RAD-NeRF/nerf/utils.py", line 906, in train
File "/content/drive/MyDrive/RAD-NeRF/nerf/utils.py", line 1156, in train_one_epoch
File "/usr/local/lib/python3.10/dist-packages/torch/utils/data/dataloader.py", line 633, in next
data = self._next_data()
File "/usr/local/lib/python3.10/dist-packages/torch/utils/data/dataloader.py", line 677, in _next_data
data = self._dataset_fetcher.fetch(index) # may raise StopIteration
File "/usr/local/lib/python3.10/dist-packages/torch/utils/data/_utils/fetch.py", line 54, in fetch
return self.collate_fn(data)
File "/content/drive/MyDrive/RAD-NeRF/nerf/provider.py", line 670, in collate
cv2.error: OpenCV(4.7.0) /io/opencv/modules/imgproc/src/color.cpp:182: error: (-215:Assertion failed) !_src.empty() in function 'cvtColor'

Exception ignored in atexit callback: <function FileWriter.init..cleanup at 0x7fb6a66939a0>
Traceback (most recent call last):
File "/usr/local/lib/python3.10/dist-packages/tensorboardX/writer.py", line 108, in cleanup
self.event_writer.close()
File "/usr/local/lib/python3.10/dist-packages/tensorboardX/event_file_writer.py", line 156, in close
self.flush()
File "/usr/local/lib/python3.10/dist-packages/tensorboardX/event_file_writer.py", line 148, in flush
self._ev_writer.flush()
File "/usr/local/lib/python3.10/dist-packages/tensorboardX/event_file_writer.py", line 69, in flush
self._py_recordio_writer.flush()
File "/usr/local/lib/python3.10/dist-packages/tensorboardX/record_writer.py", line 193, in flush
self._writer.flush()
OSError: [Errno 107] Transport endpoint is not connected
Exception in thread Thread-2:
Traceback (most recent call last):
File "/usr/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
loss=0.0002 (0.0007), lr=0.000380: 76% 1534/2030 [02:48<00:54, 9.10it/s]
Exception ignored in: <function Trainer.del at 0x7fb6cf2e5900>
Traceback (most recent call last):
File "/content/drive/MyDrive/RAD-NeRF/nerf/utils.py", line 704, in del
OSError: [Errno 107] Transport endpoint is not connected

Answer 1 · 2023-06-27T15:39:49.000Z

It was just runtime running out