Trying to allocate about 5000 GiB

Question

Trying to allocate about 5000 GiB

Closed this issue 2 years ago · 1 comments

Hello!
I'm interested in your great work and trying to run your code.
Although, I have a little trouble solving following error that says I'm trying to allocate about 5000 GiB.
I think the number is too large.
Do you have any idea about this error regarding your data or model size, etc?

Environment:

Docker image: pytorch/pytorch:1.7.0-cuda11.0-cudnn8-devel
GPU: GeForce RTX 3090, 24GB

root@oucyz:/workspace# scripts/train_small.sh
not detected /blob directory, execute locally
Utilize 1 gpus
/root/.local/lib/python3.8/site-packages/MinkowskiEngine/__init__.py:36: UserWarning: The environment variable `OMP_NUM_THREADS` not set. MinkowskiEngine will automatically set `OMP_NUM_THREADS=16`. If you want to set `OMP_NUM_THREADS` manually, please export it on the command line before running a python script. e.g. `export OMP_NUM_THREADS=12; python your_program.py`. It is recommended to set it below 24.
  warnings.warn(
using instance norm
2022-11-07 09:07:10.839856: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2022-11-07 09:07:11.057767: E tensorflow/stream_executor/cuda/cuda_blas.cc:2981] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2022-11-07 09:07:11.771248: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64
2022-11-07 09:07:11.771496: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64
2022-11-07 09:07:11.771546: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly.
not detected /blob directory, execute locally
Utilize 1 gpus
using instance norm
2022-11-07 09:07:13.650011: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2022-11-07 09:07:13.897804: E tensorflow/stream_executor/cuda/cuda_blas.cc:2981] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2022-11-07 09:07:14.678678: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64
2022-11-07 09:07:14.678788: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64
2022-11-07 09:07:14.678800: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly.
load data from data/train_small
packed pkl folder detected, will load from packed pkl file
Successfully Loaded from 19 files:19
max number of corners in single sample: 32
2 curves at least
448 valid curves total
250 valid corners total
225 patches total
min and max points in single patch: 512 512
0 open shapes
squared curve length statistics: 448 3.682233176771585e-05 9.269843007646973 0.3276165163696903
patch area statistics: 225 0.000722008498996729 1.3858339398102544 0.1676870428241
normal is included in input signal
load data from data/train_small
packed pkl folder detected, will load from packed pkl file
Successfully Loaded from 19 files:19
max number of corners in single sample: 32
2 curves at least
448 valid curves total
250 valid corners total
225 patches total
min and max points in single patch: 512 512
0 open shapes
squared curve length statistics: 448 3.682233176771585e-05 9.269843007646973 0.3276165163696903
patch area statistics: 225 0.000722008498996729 1.3858339398102544 0.1676870428241
normal is included in input signal
number of params: 22057152 87052323
Try to restore from checkpoint
  0%|                                                                                                                                                                                                                   | 0/5 [00:00<?, ?it/s]Start Training
train data size 19
/workspace/data_loader_abc.py:248: NumbaWarning:
Compilation is falling back to object mode WITH looplifting enabled because Function "points2sparse_voxel" failed type inference due to: No implementation of function Function(<function norm at 0x7fb7228699d0>) found for signature:

 >>> norm(array(float32, 2d, A), axis=Literal[int](1), keepdims=Literal[bool](True))

There are 2 candidate implementations:
  - Of which 2 did not match due to:
  Overload in function 'norm_impl': File: numba/np/linalg.py: Line 2351.
    With argument(s): '(array(float32, 2d, A), axis=int64, keepdims=bool)':
   Rejected as the implementation raised a specific error:
     TypingError: got an unexpected keyword argument 'axis'
  raised from /root/.local/lib/python3.8/site-packages/numba/core/typing/templates.py:791

During: resolving callee type: Function(<function norm at 0x7fb7228699d0>)
During: typing of call at /workspace/data_loader_abc.py (255)


File "data_loader_abc.py", line 255:
def points2sparse_voxel(points_with_normal, voxel_dim, feature_type, with_normal, pad1s):
    <source elided>
    voxel_coord = np.clip(np.floor(points / voxel_length).astype(np.int32), 0, voxel_dim-1)
    points_normal_norm = linalg.norm(points_with_normal[:,3:], axis=1, keepdims=True)
    ^

  @numba.jit()
/root/.local/lib/python3.8/site-packages/numba/core/object_mode_passes.py:151: NumbaWarning: Function "points2sparse_voxel" was compiled in object mode without forceobj=True.

File "data_loader_abc.py", line 249:
@numba.jit()
def points2sparse_voxel(points_with_normal, voxel_dim, feature_type, with_normal, pad1s):
^

  warnings.warn(errors.NumbaWarning(warn_msg,
/root/.local/lib/python3.8/site-packages/numba/core/object_mode_passes.py:161: NumbaDeprecationWarning:
Fall-back from the nopython compilation path to the object mode compilation path has been detected, this is deprecated behaviour.

For more information visit https://numba.readthedocs.io/en/stable/reference/deprecation.html#deprecation-of-object-mode-fall-back-behaviour-when-using-jit

File "data_loader_abc.py", line 249:
@numba.jit()
def points2sparse_voxel(points_with_normal, voxel_dim, feature_type, with_normal, pad1s):
^

  warnings.warn(errors.NumbaDeprecationWarning(msg,
  0%|                                                                                                                                                                                                                   | 0/5 [00:05<?, ?it/s]
Traceback (most recent call last):
  File "Minkowski_backbone.py", line 4582, in <module>
    mp.spawn(pipeline_abc,
  File "/opt/conda/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 199, in spawn
    return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
  File "/opt/conda/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 157, in start_processes
    while not context.join():
  File "/opt/conda/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 118, in join
    raise Exception(msg)
Exception:

-- Process 0 terminated with the following error:
Traceback (most recent call last):
  File "/opt/conda/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 19, in _wrap
    fn(i, *args)
  File "/workspace/Minkowski_backbone.py", line 4370, in pipeline_abc
    patch_loss_dict, patch_matching_indices = patch_loss_criterion(patch_predictions, target_patches_list)
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/workspace/Minkowski_backbone.py", line 2532, in forward
    losses.update(self.get_loss(loss, outputs, targets, indices, num_corners))
  File "/workspace/Minkowski_backbone.py", line 2495, in get_loss
    return loss_map[loss](outputs, targets, indices, num_patches, **kwargs)
  File "/workspace/Minkowski_backbone.py", line 2220, in loss_geometry
    loss_geom[uclose_id] = emd_by_id(target_patch_points_batch[uclose_id], src_patch_points[uclose_id], self.emd_idlist_u, points_per_patch_dim)
RuntimeError: CUDA out of memory. Tried to allocate 4966.70 GiB (GPU 0; 23.69 GiB total capacity; 9.58 GiB already allocated; 12.36 GiB free; 9.62 GiB reserved in total by PyTorch)

oucyz

Answer 1 · 2022-11-08T07:03:14.000Z

hi, oucyz. I've not encounter this issue before. Have you tried to forward the checkpoint directly by running: ./scripts/test_default.sh?