Training ScanNet200 dataset Error
xiaotiancai899 opened this issue · 3 comments
When I was training the ScanNet200 dataset, An error occured at the epoch55 out of 120.
Traceback (most recent call last):
File "tools/train.py", line 332, in
main()
File "tools/train.py", line 323, in main
train(epoch, model, optimizer, scheduler, scaler, train_loader, cfg, logger, writer)
File "tools/train.py", line 80, in train
loss, log_vars = model(batch, return_loss=True, epoch=epoch - 1) # 这个epoch有没有可能会变成-1之类的啊???
File "/home/clinton/anaconda3/envs/isbnet/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/mnt/d/student/Documents/software/wsl/isbnet/isbnet-master/isbnet-master/isbnet/model/isbnet.py", line 219, in forward
return self.forward_train(**batch, epoch=epoch)
File "/mnt/d/student/Documents/software/wsl/isbnet/isbnet-master/isbnet-master/isbnet/util/utils.py", line 172, in wrapper
return func(*new_args, **new_kwargs)
File "/mnt/d/student/Documents/software/wsl/isbnet/isbnet-master/isbnet-master/isbnet/model/isbnet.py", line 265, in forward_train
feats, coords_float, voxel_coords, spatial_shape, batch_size, p2v_map
File "/mnt/d/student/Documents/software/wsl/isbnet/isbnet-master/isbnet-master/isbnet/model/isbnet.py", line 632, in forward_backbone
output = self.unet(output)
File "/home/clinton/anaconda3/envs/isbnet/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/mnt/d/student/Documents/software/wsl/isbnet/isbnet-master/isbnet-master/isbnet/model/blocks.py", line 250, in forward
output_decoder = self.u(output_decoder)
File "/home/clinton/anaconda3/envs/isbnet/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/mnt/d/student/Documents/software/wsl/isbnet/isbnet-master/isbnet-master/isbnet/model/blocks.py", line 250, in forward
output_decoder = self.u(output_decoder)
File "/home/clinton/anaconda3/envs/isbnet/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/mnt/d/student/Documents/software/wsl/isbnet/isbnet-master/isbnet-master/isbnet/model/blocks.py", line 250, in forward
output_decoder = self.u(output_decoder)
File "/home/clinton/anaconda3/envs/isbnet/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/mnt/d/student/Documents/software/wsl/isbnet/isbnet-master/isbnet-master/isbnet/model/blocks.py", line 250, in forward
output_decoder = self.u(output_decoder)
File "/home/clinton/anaconda3/envs/isbnet/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/mnt/d/student/Documents/software/wsl/isbnet/isbnet-master/isbnet-master/isbnet/model/blocks.py", line 250, in forward
output_decoder = self.u(output_decoder)
File "/home/clinton/anaconda3/envs/isbnet/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/mnt/d/student/Documents/software/wsl/isbnet/isbnet-master/isbnet-master/isbnet/model/blocks.py", line 249, in forward
output_decoder = self.conv(output)
File "/home/clinton/anaconda3/envs/isbnet/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/home/clinton/anaconda3/envs/isbnet/lib/python3.7/site-packages/spconv/pytorch/modules.py", line 137, in forward
input = module(input)
File "/home/clinton/anaconda3/envs/isbnet/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/home/clinton/anaconda3/envs/isbnet/lib/python3.7/site-packages/spconv/pytorch/conv.py", line 404, in forward
raise e
File "/home/clinton/anaconda3/envs/isbnet/lib/python3.7/site-packages/spconv/pytorch/conv.py", line 395, in forward
timer=input._timer)
File "/home/clinton/anaconda3/envs/isbnet/lib/python3.7/site-packages/spconv/pytorch/ops.py", line 465, in get_indice_pairs_implicit_gemm
stream_int=stream)
RuntimeError: /tmp/pip-build-env-a41g0q_q/overlay/lib/python3.7/site-packages/cumm/include/tensorview/cuda/launch.h(53)
N > 0 assert faild. CUDA kernel launch blocks must be positive, but got N= 0
I used bach_size=1, and also avoided OOM during training freezing all BatchNorm layers during training.
Any ideas about that? Thanks so much in advance!
You could check similar issues on the original repo of spconv
: traveller59/spconv#406, mit-han-lab/bevfusion#82.
Best.
Those two cannot solve my problem. Any other advice?
You could check similar issues on the original repo of
spconv
: traveller59/spconv#406, mit-han-lab/bevfusion#82.Best.