Cuda out of memory
Michaelwhite34 opened this issue · 6 comments
train_scene.sh drv/rabbit
Hello Wooden
Load data: Begin
Not using masks
image shape, mask shape: torch.Size([324, 768, 1024, 3]) torch.Size([324, 768, 1024, 3])
image pixel range: 0.0 1.0
Load data: End
0%| | 0/100001 [00:00<?, ?it/s]
Traceback (most recent call last):
File "render_volume.py", line 449, in
runner.train()
File "render_volume.py", line 127, in train
render_out = self.renderer.render(
File "/home/michael/iron/models/renderer.py", line 374, in render
ret_fine = self.render_core(
File "/home/michael/iron/models/renderer.py", line 233, in render_core
gradients = sdf_network.gradient(pts)
File "/home/michael/iron/models/fields.py", line 110, in gradient
gradients = torch.autograd.grad(
File "/home/michael/anaconda3/envs/iron/lib/python3.8/site-packages/torch/autograd/init.py", line 275, in grad
return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
RuntimeError: CUDA out of memory. Tried to allocate 64.00 MiB (GPU 0; 5.80 GiB total capacity; 4.03 GiB already allocated; 118.56 MiB free; 4.08 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
Wrote config file to ./exp_iron_stage2/drv/rabbit/args.txt
render_surface.py:256: DeprecationWarning: Starting with ImageIO v3 the behavior of this function will switch to that of iio.v3.imread. To keep the current behavior (and make this warning dissapear) use import imageio.v2 as imageio
or call imageio.v2.imread
directly.
im = imageio.imread(fpath).astype(np.float32) / 255.0
ic| fill_holes: False
handle_edges: True
is_training: True
args.inv_gamma_gt: False
0%| | 0/50001 [00:00<?, ?it/s]ic| args.out_dir: './exp_iron_stage2/drv/rabbit'
global_step: 0
loss.item(): 0.00573146715760231
img_loss.item(): 0.0
img_l2_loss.item(): 0.0
img_ssim_loss.item(): 0.0
eik_loss.item(): 0.00573146715760231
roughrange_loss.item(): 0.0
color_network_dict["point_light_network"].get_light().item(): 5.6220927238464355
1%|▎ | 499/50001 [01:35<3:20:37, 4.11it/s]ic| args.out_dir: './exp_iron_stage2/drv/rabbit'
global_step: 500
loss.item(): 0.014144735410809517
img_loss.item(): 0.0
img_l2_loss.item(): 0.0
img_ssim_loss.item(): 0.0
eik_loss.item(): 0.014144735410809517
roughrange_loss.item(): 0.0
color_network_dict["point_light_network"].get_light().item(): 5.224419593811035
Another out of memory when I stop the process
^Z
[1]+ Stopped python render_surface.py --data_dir ./data_flashlight/${SCENE}/train --out_dir ./exp_iron_stage2/${SCENE} --neus_ckpt_fpath ./exp_iron_stage1/${SCENE}/checkpoints/ckpt_100000.pth --num_iters 50001 --gamma_pred
ic| args: Namespace(data_dir='./data_flashlight/drv/rabbit/test', eik_weight=0.1, export_all=False, gamma_pred=True, init_light_scale=8.0, inv_gamma_gt=False, is_metal=False, neus_ckpt_fpath='./exp_iron_stage1/drv/rabbit/checkpoints/ckpt_100000.pth', no_edgesample=False, num_iters=50001, out_dir='./exp_iron_stage2/drv/rabbit', patch_size=128, plot_image_name=None, render_all=True, roughrange_weight=0.1, ssim_weight=1.0)
Wrote config file to ./exp_iron_stage2/drv/rabbit/args.txt
Traceback (most recent call last):
File "render_surface.py", line 136, in
sdf_network = SDFNetwork(
File "/home/michael/anaconda3/envs/iron/lib/python3.8/site-packages/torch/nn/modules/module.py", line 688, in cuda
return self._apply(lambda t: t.cuda(device))
File "/home/michael/anaconda3/envs/iron/lib/python3.8/site-packages/torch/nn/modules/module.py", line 578, in _apply
module._apply(fn)
File "/home/michael/anaconda3/envs/iron/lib/python3.8/site-packages/torch/nn/modules/module.py", line 601, in _apply
param_applied = fn(param)
File "/home/michael/anaconda3/envs/iron/lib/python3.8/site-packages/torch/nn/modules/module.py", line 688, in
return self._apply(lambda t: t.cuda(device))
RuntimeError: CUDA error: out of memory
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Note at least 12GB GPU memory is needed for the default settings. You can try decreasing the rendered patch size if you have less memory.
Note at least 12GB GPU memory is needed for the default settings. You can try decreasing the rendered patch size if you have less memory.
I decreased batch_size and n_samples in womask_Iron and it only gives error in the final export mesh and uv stage.
Can you tell me exactly which Parameter and file I should be modifying ?
100%|███████████████████████████████████| 50001/50001 [5:30:39<00:00, 2.52it/s]
ic| f"Exporting mesh and materials to: {export_out_dir}": ('Exporting mesh and materials to: '
'./exp_iron_stage2/drv/rabbit/mesh_and_materials_50000')
ic| 'Exporting mesh and uv...'
face_normals incorrect shape, ignoring!
/home/michael/iron/models/export_mesh.py:82: UserWarning: torch.eig is deprecated in favor of torch.linalg.eig and will be removed in a future PyTorch release.
torch.linalg.eig returns complex tensors of dtype cfloat or cdouble rather than real tensors mimicking complex tensors.
L, _ = torch.eig(A)
should be replaced with
L_complex = torch.linalg.eigvals(A)
and
L, V = torch.eig(A, eigenvectors=True)
should be replaced with
L_complex, V_complex = torch.linalg.eig(A) (Triggered internally at ../aten/src/ATen/native/BatchLinearAlgebra.cpp:2910.)
vecs = torch.eig(s_cov, True)[1].transpose(0, 1)
Traceback (most recent call last):
File "render_surface.py", line 549, in
export_mesh_and_materials(export_out_dir, sdf_network, color_network_dict)
File "render_surface.py", line 325, in export_mesh_and_materials
export_mesh(sdf_fn, os.path.join(export_out_dir, "mesh.obj"))
File "/home/michael/iron/models/export_mesh.py", line 87, in export_mesh
grid_aligned = get_grid(helper.cpu(), resolution)
File "/home/michael/iron/models/export_mesh.py", line 41, in get_grid
grid_points = torch.tensor(np.vstack([xx.ravel(), yy.ravel(), zz.ravel()]).T, dtype=torch.float).cuda()
RuntimeError: CUDA out of memory. Tried to allocate 4.52 GiB (GPU 0; 5.80 GiB total capacity; 68.10 MiB already allocated; 4.27 GiB free; 104.00 MiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
ic| args: Namespace(data_dir='./data_flashlight/drv/rabbit/test', eik_weight=0.1, export_all=False, gamma_pred=True, init_light_scale=8.0, inv_gamma_gt=False, is_metal=False, neus_ckpt_fpath='./exp_iron_stage1/drv/rabbit/checkpoints/ckpt_100000.pth', no_edgesample=False, num_iters=50001, out_dir='./exp_iron_stage2/drv/rabbit', patch_size=128, plot_image_name=None, render_all=True, roughrange_weight=0.1, ssim_weight=1.0)
Wrote config file to ./exp_iron_stage2/drv/rabbit/args.txt
render_surface.py:256: DeprecationWarning: Starting with ImageIO v3 the behavior of this function will switch to that of iio.v3.imread. To keep the current behavior (and make this warning dissapear) use import imageio.v2 as imageio
or call imageio.v2.imread
directly.
im = imageio.imread(fpath).astype(np.float32) / 255.0
ic| len(image_fpaths): 82
gt_images.shape: torch.Size([82, 768, 1024, 3])
Ks.shape: torch.Size([82, 4, 4])
W2Cs.shape: torch.Size([82, 4, 4])
len(cameras): 82
ic| args.neus_ckpt_fpath: './exp_iron_stage1/drv/rabbit/checkpoints/ckpt_100000.pth'
ic| f"Loading from neus checkpoint: {args.neus_ckpt_fpath}": ('Loading from neus checkpoint: '
'./exp_iron_stage1/drv/rabbit/checkpoints/ckpt_100000.pth')
ic| "Reloading from checkpoint: ": 'Reloading from checkpoint: '
ckpt_fpath: './exp_iron_stage2/drv/rabbit/ckpt_50000.pth'
ic| dist: 0.8803050220012665
color_network_dict["point_light_network"].light.data: tensor(1.7133, device='cuda:0')
ic| start_step: 50000
ic| f"Rendering images to: {render_out_dir}": 'Rendering images to: ./exp_iron_stage2/drv/rabbit/render_test_50000'
2%|█ | 2/82 [00:23<15:21, 11.52s/it]
Traceback (most recent call last):
File "render_surface.py", line 367, in
results = render_camera(
File "/home/michael/iron/models/raytracer.py", line 834, in render_camera
results = raytrace_camera(
File "/home/michael/anaconda3/envs/iron/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
return func(*args, **kwargs)
File "/home/michael/iron/models/raytracer.py", line 581, in raytrace_camera
results = raytrace_pixels(sdf_network, raytracer, camera.get_uv(), camera, max_num_rays=max_num_rays)
File "/home/michael/anaconda3/envs/iron/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
return func(*args, **kwargs)
File "/home/michael/iron/models/raytracer.py", line 392, in raytrace_pixels
results = raytracer(
File "/home/michael/anaconda3/envs/iron/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
return forward_call(*input, **kwargs)
File "/home/michael/anaconda3/envs/iron/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
return func(*args, **kwargs)
File "/home/michael/iron/models/raytracer.py", line 67, in forward
(sampler_convergent_mask, sampler_points, sampler_sdf, sampler_dis,) = self.ray_sampler(
File "/home/michael/iron/models/raytracer.py", line 154, in ray_sampler
sdf_val.append(sdf(pnts))
File "/home/michael/iron/models/raytracer.py", line 370, in
sdf = lambda x: sdf_network(x)[..., 0]
File "/home/michael/anaconda3/envs/iron/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
return forward_call(*input, **kwargs)
File "/home/michael/iron/models/fields.py", line 92, in forward
x = torch.cat([x, inputs], -1) / np.sqrt(2)
RuntimeError: CUDA out of memory. Tried to allocate 196.00 MiB (GPU 0; 5.80 GiB total capacity; 3.94 GiB already allocated; 131.62 MiB free; 4.25 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
And another question, after preparing for my own images, I just need to run colmap_runner to get kai_cameras_normalized.json and rename it to cam_dict_norm.json ?
@Kai-46 yeah I met same problem,When trainning superman dataset.
I found in models/export_mesh.py
grid_points = torch.tensor(np.vstack([xx.ravel(), yy.ravel(), zz.ravel()]).T, dtype=torch.float).cuda()
xx, yy, zz size are huge
How to change the default settings?