jby1993/SelfReconCode

CUDA out of memory

Opened this issue · 8 comments

Hello, I've been training for a while, But an error is reported halfway. Is there any way to solve this problem wiht no changing the graphics card

scene data use female smpl
/home/xds/anaconda3/envs/SelfRecon/lib/python3.8/site-packages/torch/functional.py:445: UserWarning: torch.meshgrid: in an upcoming release, it will be required to pass the indexing argument. (Triggered internally at /opt/conda/conda-bld/pytorch_1640811806235/work/aten/src/ATen/native/TensorShape.cpp:2157.)
return _VF.meshgrid(tensors, **kwargs) # type: ignore[attr-defined]
camera ang threshold is 0.010285
box:
[-0.7080196142196655, -1.2795634269714355, -0.3215314447879791]
[0.7120546102523804, 0.7051210403442383, 0.3668109178543091]
/home/xds/project/SelfReconCode/MCAcc/seg3d_lossless.py:246: UserWarning: floordiv is deprecated, and its behavior will change in a future version of pytorch. It currently rounds toward 0 (like the 'trunc' function NOT 'floor'). This results in incorrect rounding for negative values. To keep the current behavior, use torch.div(a, b, rounding_mode='trunc'), or for actual floor division, use torch.div(a, b, rounding_mode='floor').
stride = (self.resolutions[-1] - 1) // (resolution - 1)
/home/xds/project/SelfReconCode/MCAcc/seg3d_lossless.py:261: UserWarning: floordiv is deprecated, and its behavior will change in a future version of pytorch. It currently rounds toward 0 (like the 'trunc' function NOT 'floor'). This results in incorrect rounding for negative values. To keep the current behavior, use torch.div(a, b, rounding_mode='trunc'), or for actual floor division, use torch.div(a, b, rounding_mode='floor').
coords_accum = coords // stride
/home/xds/project/SelfReconCode/MCAcc/seg3d_lossless.py:341: UserWarning: floordiv is deprecated, and its behavior will change in a future version of pytorch. It currently rounds toward 0 (like the 'trunc' function NOT 'floor'). This results in incorrect rounding for negative values. To keep the current behavior, use torch.div(a, b, rounding_mode='trunc'), or for actual floor division, use torch.div(a, b, rounding_mode='floor').
voxels = coords // stride
/home/xds/project/SelfReconCode/MCAcc/seg3d_lossless.py:381: UserWarning: floordiv is deprecated, and its behavior will change in a future version of pytorch. It currently rounds toward 0 (like the 'trunc' function NOT 'floor'). This results in incorrect rounding for negative values. To keep the current behavior, use torch.div(a, b, rounding_mode='trunc'), or for actual floor division, use torch.div(a, b, rounding_mode='floor').
point_coords = coords // stride
/home/xds/project/SelfReconCode/MCAcc/seg3d_lossless.py:417: UserWarning: floordiv is deprecated, and its behavior will change in a future version of pytorch. It currently rounds toward 0 (like the 'trunc' function NOT 'floor'). This results in incorrect rounding for negative values. To keep the current behavior, use torch.div(a, b, rounding_mode='trunc'), or for actual floor division, use torch.div(a, b, rounding_mode='floor').
voxels = coords // stride
Traceback (most recent call last):
File "train.py", line 167, in
loss=optNet(outs,sample_pix_num,ratio,frame_ids,debug_root)
File "/home/xds/anaconda3/envs/SelfRecon/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "/home/xds/project/SelfReconCode/model/network.py", line 502, in forward
total_loss=self.computeTmpPcLoss(defMeshes,[d_cond,[poses,trans]],masks,mgtMs,ratio)
File "/home/xds/project/SelfReconCode/model/network.py", line 687, in computeTmpPcLoss
loss.backward()
File "/home/xds/anaconda3/envs/SelfRecon/lib/python3.8/site-packages/torch/_tensor.py", line 307, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
File "/home/xds/anaconda3/envs/SelfRecon/lib/python3.8/site-packages/torch/autograd/init.py", line 154, in backward
Variable._execution_engine.run_backward(
File "/home/xds/anaconda3/envs/SelfRecon/lib/python3.8/site-packages/torch/autograd/function.py", line 199, in apply
return user_fn(self, *args)
File "/home/xds/anaconda3/envs/SelfRecon/lib/python3.8/site-packages/pytorch3d-0.4.0-py3.8-linux-x86_64.egg/pytorch3d/renderer/compositing.py", line 56, in backward
grad_features, grad_alphas = _C.accum_alphacomposite_backward(
RuntimeError: CUDA out of memory. Tried to allocate 668.00 MiB (GPU 0; 10.76 GiB total capacity; 8.00 GiB already allocated; 443.38 MiB free; 8.18 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

The default config requires some memories and a GTX 3090 is recommended. You can change the marching cube resolutions to reduce memory, but the related optimization parameters are also needed to readjust. This is a little tedious.

thank you! I will try to adjust the parameters, hoping to succeed

Do you know how much memory you need

almost 24 Gb

I'm using GeForce RTX 3070 Laptop GPU, and got the same error as below.
I edited config.conf a bit; reducing "sample_pix_num", "num_workers", "batch_size", but all in fail.
Which parameters should I edit to avoid CUDA out of memory error?

error message

$ CUDA_VISIBLE_DEVICES=0 python train.py --gpu-ids 0 --conf config.conf --data $ROOT/female-3-casual --save-folder result
scene data use female smpl
/home/mas/anaconda3/envs/SelfRecon/lib/python3.8/site-packages/torch/functional.py:445: UserWarning: torch.meshgrid: in an upcoming release, it will be required to pass the indexing argument. (Triggered internally at /opt/conda/conda-bld/pytorch_1640811806235/work/aten/src/ATen/native/TensorShape.cpp:2157.)
return _VF.meshgrid(tensors, **kwargs) # type: ignore[attr-defined]
Traceback (most recent call last):
File "train.py", line 98, in
optNet,sdf_initialized=getOptNet(dataset,batch_size,bmins,bmaxs,resolutions['coarse'],device,config,use_initial_sdf)
File "/home/mas/proj/study/computer_vision/SelfReconCode/model/network.py", line 850, in getOptNet
skinner,tmpBodyVs,tmpBodyFs=initialLBSkinner(dataset.gender,dataset.shape.to(device),initPose,(128+1, 224+1, 64+1),bmins,bmaxs)
File "/home/mas/proj/study/computer_vision/SelfReconCode/model/Deformer.py", line 294, in initialLBSkinner
ws=compute_lbswField(bmins,bmaxs,resolution,verts.view(6890,3),smpl.weight.view(6890,24),align_corners=False,mean_neighbor=30,smooth_times=30)
File "/home/mas/proj/study/computer_vision/SelfReconCode/model/Deformer.py", line 269, in compute_lbswField
dists,indices=(tmp[:,None,:]-smpl_verts[None,:,:]).norm(dim=-1).topk(mean_neighbor,dim=-1,largest=False)
File "/home/mas/anaconda3/envs/SelfRecon/lib/python3.8/site-packages/torch/_tensor.py", line 442, in norm
return torch.norm(self, p, dim, keepdim, dtype=dtype)
File "/home/mas/anaconda3/envs/SelfRecon/lib/python3.8/site-packages/torch/functional.py", line 1442, in norm
return _VF.frobenius_norm(input, _dim, keepdim=keepdim)
RuntimeError: CUDA out of memory. Tried to allocate 1.29 GiB (GPU 0; 7.80 GiB total capacity; 5.22 GiB already allocated; 724.12 MiB free; 5.23 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

zhihu
import os os.environ["PYTORCH_CUDA_ALLOC_CONF"] = "max_split_size_mb:128"

Thank you, will try.

I put

os.environ["PYTORCH_CUDA_ALLOC_CONF"] = "max_split_size_mb:128"

at line 27 in train.py, and run

CUDA_VISIBLE_DEVICES=0 python train.py --gpu-ids 0 --conf config.conf --data $ROOT/female-3-casual --save-folder result

But it failed with "Segmentation fault (core dumped)" ...