hwjiang1510/LEAP

Colab torch CUDA out of Memory

abrar-khan-368 opened this issue · 0 comments

Hi, I was trying to run the demo.sh file but in colab I'm getting out of memory error on runtime. Is there any upcoming update plan on model size reduction (I used pre-trained weights given in the links). Tried reducing batch_size from the config file but didn't work

Error:

/usr/local/lib/python3.10/site-packages/torch/distributed/launch.py:181: FutureWarning: The module torch.distributed.launch is deprecated
and will be removed in future. Use torchrun.
Note that --use-env is set by default in torchrun.
If your script expects `--local-rank` argument to be set, please
change it to read from `os.environ['LOCAL_RANK']` instead. See 
https://pytorch.org/docs/stable/distributed.html#launch-utility for 
further instructions

  warnings.warn(
Namespace(cfg='./config/demo/demo_224_real.yaml', local_rank=-1, cpt='/content/sfvznslazrwrrof8fv7uy23myc0oizhx.tar', permute=False)
{'dataset': {'aug_brightness': 0,
             'aug_contrast': 0,
             'aug_hue': 0,
             'aug_saturation': 0,
             'augmentation': False,
             'category': 'general',
             'frame_interval': 5,
             'img_size': 224,
             'img_size_height': 0,
             'img_size_render': 112,
             'mask_images': True,
             'name': 'demo',
             'num_frame': 5,
             'task': 'singlesequence',
             'train_all_frame': False,
             'train_shuffle': False},
 'eval_vis_freq': 20,
 'exp_group': 'leap-demo',
 'exp_name': 'demo_real',
 'log_dir': './log',
 'loss': {'iter_perceptual': 10000,
          'weight_feat_render': 1.0,
          'weight_perceptual': 0.1,
          'weight_render_mask': 5.0,
          'weight_render_rgb': 1.0},
 'model': {'backbone_fix': False,
           'backbone_name': 'dinov2',
           'backbone_out_dim': 768,
           'backbone_type': 'vitb14',
           'encoder_layers': 2,
           'latent_res': 16,
           'lifting_TXdecoder_permute': False,
           'lifting_layers': 4,
           'lifting_use_conv3d': False,
           'neck_layers': 2,
           'neck_scale': 'constant_1',
           'norm_first': False,
           'pe_with_spatial_pe': False,
           'render_feat_dim': 16,
           'render_feat_raw': False,
           'rot_representation': 'quat',
           'use_flash_attn': False,
           'use_neck': False,
           'use_pe_lifting': False,
           'volume_res': 64},
 'output_dir': './output/',
 'print_freq': 100,
 'render': {'camera_focal': 250,
            'camera_z': 4.0,
            'k_size': 5,
            'max_depth': 5.0,
            'min_depth': 3.0,
            'n_pts_per_ray': 64,
            'volume_size': 2.0},
 'seed': 42,
 'test': {'batch_size': 1, 'compute_metric': True},
 'train': {'accumulation_step': 1,
           'batch_size': 1,
           'grad_max': 5.0,
           'lr': 0.0002,
           'lr_backbone': 1e-05,
           'lr_embeddings': 0.0001,
           'min_rand_view': 3,
           'normalize_img': True,
           'pretrain_path': '',
           'resume': True,
           'schedular_warmup_iter': 500,
           'total_iteration': 200000,
           'use_amp': False,
           'use_rand_view': False,
           'use_uncanonicalized_pose': False,
           'weight_decay': 0.005},
 'vis_freq': 500,
 'workers': 4}
checkpoint path: /content/sfvznslazrwrrof8fv7uy23myc0oizhx.tar
[57631] Initializing process group with: {'MASTER_ADDR': '127.0.0.1', 'MASTER_PORT': '8888', 'RANK': '0', 'WORLD_SIZE': '1'}
Added key: store_based_barrier_key:1 to store for rank: 0
Rank 0: Completed store-based barrier for key:store_based_barrier_key:1 with 1 nodes.
[57631]: world_size = 1, rank = 0, backend=nccl
Using cache found in /root/.cache/torch/hub/facebookresearch_dinov2_main
/root/.cache/torch/hub/facebookresearch_dinov2_main/dinov2/layers/swiglu_ffn.py:43: UserWarning: xFormers is available (SwiGLU)
  warnings.warn("xFormers is available (SwiGLU)")
/root/.cache/torch/hub/facebookresearch_dinov2_main/dinov2/layers/attention.py:27: UserWarning: xFormers is available (Attention)
  warnings.warn("xFormers is available (Attention)")
/root/.cache/torch/hub/facebookresearch_dinov2_main/dinov2/layers/block.py:33: UserWarning: xFormers is available (Block)
  warnings.warn("xFormers is available (Block)")
using MLP layer as FFN
using 1 cuda
loading demo images
9 scenes found ['0', '1', '2', '3', '4', '5', '6', '7', '8']
all images shape torch.Size([9, 3, 3, 224, 224])
scene idx 0, permutation [0, 1, 2]
/usr/local/lib/python3.10/site-packages/torch/nn/functional.py:3737: UserWarning: nn.functional.upsample is deprecated. Use nn.functional.interpolate instead.
  warnings.warn("nn.functional.upsample is deprecated. Use nn.functional.interpolate instead.")
scene idx 1, permutation [0, 1, 2]
Traceback (most recent call last):
  File "/content/LEAP/demo.py", line 112, in <module>
    main()
  File "/content/LEAP/demo.py", line 100, in main
    perform_inference(config, args.permute,
  File "/content/LEAP/scripts/demo.py", line 40, in perform_inference
    nvs_results = get_nvs_results(config, model, neural_volume, device)
  File "/content/LEAP/scripts/demo.py", line 95, in get_nvs_results
    render_results = model.module.render_module.render(cameras, cur_features, cur_densities)
  File "/usr/local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/content/LEAP/model/volume_render.py", line 97, in forward
    rendered = self.renderer(cameras=cameras, volumes=volume, render_depth=render_depth)[0]  # [B,H,W,C+1]
  File "/usr/local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/pytorch3d/renderer/implicit/renderer.py", line 253, in forward
    return self.renderer(
  File "/usr/local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/pytorch3d/renderer/implicit/renderer.py", line 172, in forward
    rays_densities, rays_features = volumetric_function(
  File "/usr/local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/pytorch3d/renderer/implicit/renderer.py", line 390, in forward
    rays_features = torch.nn.functional.grid_sample(
  File "/usr/local/lib/python3.10/site-packages/torch/nn/functional.py", line 4244, in grid_sample
    return torch.grid_sampler(input, grid, mode_enum, padding_mode_enum, align_corners)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 344.00 MiB (GPU 0; 14.75 GiB total capacity; 13.81 GiB already allocated; 295.06 MiB free; 14.16 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 57631) of binary: /usr/local/bin/python
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/local/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/usr/local/lib/python3.10/site-packages/torch/distributed/launch.py", line 196, in <module>
    main()
  File "/usr/local/lib/python3.10/site-packages/torch/distributed/launch.py", line 192, in main
    launch(args)
  File "/usr/local/lib/python3.10/site-packages/torch/distributed/launch.py", line 177, in launch
    run(args)
  File "/usr/local/lib/python3.10/site-packages/torch/distributed/run.py", line 785, in run
    elastic_launch(
  File "/usr/local/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/usr/local/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
demo.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-02-05_12:25:47
  host      : af8524bf09d5
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 57631)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================