Colab torch CUDA out of Memory
abrar-khan-368 opened this issue · 0 comments
abrar-khan-368 commented
Hi, I was trying to run the demo.sh file but in colab I'm getting out of memory error on runtime. Is there any upcoming update plan on model size reduction (I used pre-trained weights given in the links). Tried reducing batch_size from the config file but didn't work
Error:
/usr/local/lib/python3.10/site-packages/torch/distributed/launch.py:181: FutureWarning: The module torch.distributed.launch is deprecated
and will be removed in future. Use torchrun.
Note that --use-env is set by default in torchrun.
If your script expects `--local-rank` argument to be set, please
change it to read from `os.environ['LOCAL_RANK']` instead. See
https://pytorch.org/docs/stable/distributed.html#launch-utility for
further instructions
warnings.warn(
Namespace(cfg='./config/demo/demo_224_real.yaml', local_rank=-1, cpt='/content/sfvznslazrwrrof8fv7uy23myc0oizhx.tar', permute=False)
{'dataset': {'aug_brightness': 0,
'aug_contrast': 0,
'aug_hue': 0,
'aug_saturation': 0,
'augmentation': False,
'category': 'general',
'frame_interval': 5,
'img_size': 224,
'img_size_height': 0,
'img_size_render': 112,
'mask_images': True,
'name': 'demo',
'num_frame': 5,
'task': 'singlesequence',
'train_all_frame': False,
'train_shuffle': False},
'eval_vis_freq': 20,
'exp_group': 'leap-demo',
'exp_name': 'demo_real',
'log_dir': './log',
'loss': {'iter_perceptual': 10000,
'weight_feat_render': 1.0,
'weight_perceptual': 0.1,
'weight_render_mask': 5.0,
'weight_render_rgb': 1.0},
'model': {'backbone_fix': False,
'backbone_name': 'dinov2',
'backbone_out_dim': 768,
'backbone_type': 'vitb14',
'encoder_layers': 2,
'latent_res': 16,
'lifting_TXdecoder_permute': False,
'lifting_layers': 4,
'lifting_use_conv3d': False,
'neck_layers': 2,
'neck_scale': 'constant_1',
'norm_first': False,
'pe_with_spatial_pe': False,
'render_feat_dim': 16,
'render_feat_raw': False,
'rot_representation': 'quat',
'use_flash_attn': False,
'use_neck': False,
'use_pe_lifting': False,
'volume_res': 64},
'output_dir': './output/',
'print_freq': 100,
'render': {'camera_focal': 250,
'camera_z': 4.0,
'k_size': 5,
'max_depth': 5.0,
'min_depth': 3.0,
'n_pts_per_ray': 64,
'volume_size': 2.0},
'seed': 42,
'test': {'batch_size': 1, 'compute_metric': True},
'train': {'accumulation_step': 1,
'batch_size': 1,
'grad_max': 5.0,
'lr': 0.0002,
'lr_backbone': 1e-05,
'lr_embeddings': 0.0001,
'min_rand_view': 3,
'normalize_img': True,
'pretrain_path': '',
'resume': True,
'schedular_warmup_iter': 500,
'total_iteration': 200000,
'use_amp': False,
'use_rand_view': False,
'use_uncanonicalized_pose': False,
'weight_decay': 0.005},
'vis_freq': 500,
'workers': 4}
checkpoint path: /content/sfvznslazrwrrof8fv7uy23myc0oizhx.tar
[57631] Initializing process group with: {'MASTER_ADDR': '127.0.0.1', 'MASTER_PORT': '8888', 'RANK': '0', 'WORLD_SIZE': '1'}
Added key: store_based_barrier_key:1 to store for rank: 0
Rank 0: Completed store-based barrier for key:store_based_barrier_key:1 with 1 nodes.
[57631]: world_size = 1, rank = 0, backend=nccl
Using cache found in /root/.cache/torch/hub/facebookresearch_dinov2_main
/root/.cache/torch/hub/facebookresearch_dinov2_main/dinov2/layers/swiglu_ffn.py:43: UserWarning: xFormers is available (SwiGLU)
warnings.warn("xFormers is available (SwiGLU)")
/root/.cache/torch/hub/facebookresearch_dinov2_main/dinov2/layers/attention.py:27: UserWarning: xFormers is available (Attention)
warnings.warn("xFormers is available (Attention)")
/root/.cache/torch/hub/facebookresearch_dinov2_main/dinov2/layers/block.py:33: UserWarning: xFormers is available (Block)
warnings.warn("xFormers is available (Block)")
using MLP layer as FFN
using 1 cuda
loading demo images
9 scenes found ['0', '1', '2', '3', '4', '5', '6', '7', '8']
all images shape torch.Size([9, 3, 3, 224, 224])
scene idx 0, permutation [0, 1, 2]
/usr/local/lib/python3.10/site-packages/torch/nn/functional.py:3737: UserWarning: nn.functional.upsample is deprecated. Use nn.functional.interpolate instead.
warnings.warn("nn.functional.upsample is deprecated. Use nn.functional.interpolate instead.")
scene idx 1, permutation [0, 1, 2]
Traceback (most recent call last):
File "/content/LEAP/demo.py", line 112, in <module>
main()
File "/content/LEAP/demo.py", line 100, in main
perform_inference(config, args.permute,
File "/content/LEAP/scripts/demo.py", line 40, in perform_inference
nvs_results = get_nvs_results(config, model, neural_volume, device)
File "/content/LEAP/scripts/demo.py", line 95, in get_nvs_results
render_results = model.module.render_module.render(cameras, cur_features, cur_densities)
File "/usr/local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/content/LEAP/model/volume_render.py", line 97, in forward
rendered = self.renderer(cameras=cameras, volumes=volume, render_depth=render_depth)[0] # [B,H,W,C+1]
File "/usr/local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/usr/local/lib/python3.10/site-packages/pytorch3d/renderer/implicit/renderer.py", line 253, in forward
return self.renderer(
File "/usr/local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/usr/local/lib/python3.10/site-packages/pytorch3d/renderer/implicit/renderer.py", line 172, in forward
rays_densities, rays_features = volumetric_function(
File "/usr/local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/usr/local/lib/python3.10/site-packages/pytorch3d/renderer/implicit/renderer.py", line 390, in forward
rays_features = torch.nn.functional.grid_sample(
File "/usr/local/lib/python3.10/site-packages/torch/nn/functional.py", line 4244, in grid_sample
return torch.grid_sampler(input, grid, mode_enum, padding_mode_enum, align_corners)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 344.00 MiB (GPU 0; 14.75 GiB total capacity; 13.81 GiB already allocated; 295.06 MiB free; 14.16 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 57631) of binary: /usr/local/bin/python
Traceback (most recent call last):
File "/usr/local/lib/python3.10/runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/usr/local/lib/python3.10/runpy.py", line 86, in _run_code
exec(code, run_globals)
File "/usr/local/lib/python3.10/site-packages/torch/distributed/launch.py", line 196, in <module>
main()
File "/usr/local/lib/python3.10/site-packages/torch/distributed/launch.py", line 192, in main
launch(args)
File "/usr/local/lib/python3.10/site-packages/torch/distributed/launch.py", line 177, in launch
run(args)
File "/usr/local/lib/python3.10/site-packages/torch/distributed/run.py", line 785, in run
elastic_launch(
File "/usr/local/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/usr/local/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
demo.py FAILED
------------------------------------------------------------
Failures:
<NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2024-02-05_12:25:47
host : af8524bf09d5
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 57631)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================