Error executing job with overrides
bai-24 opened this issue · 1 comments
Dear Author,
I have encountered such an error:
Epoch 0 - train: 100%|▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒| 566435/566435 [40:02:24<00:00, 3.93it/s, loss=3.45]
Epoch 0 - validation: 0%| | 1/1563 [00:02<1:02:34, 2.40s/it, loss=3.Epoch 0 - validation: 0%| | 1/1563 [00:03<1:02:34, 2.40s/it, loss=3.Epoch 0 - validation: 0%| | 2/1563 [00:03<35:51, 1.38s/it, loss=3.78Epoch 0 - validation: 0%| | 2/1563 [00:04<55:13, 2.12s/it, loss=3.78]
Error executing job with overrides: []
Traceback (most recent call last):
File "train_demo.py", line 324, in run_main
main(config)
File "train_demo.py", line 174, in main
train_res = train_xe(
File "J:\GRIT\engine\caption_engine.py", line 383, in train_xe
val_loss = evaluate_loss(model, dataloaders['valid'], loss_fn, text_field, epoch, writer)
File "J:\GRIT\engine\caption_engine.py", line 298, in evaluate_loss
out = model(batch['samples'], batch['captions'])
File "D:\Anaconda3\lib\site-packages\torch\nn\modules\module.py", line 1190, in _call_impl
return forward_call(*input, **kwargs)
File "J:\GRIT\models\caption\transformer.py", line 89, in forward
vis_inputs = self.detector(images)
File "D:\Anaconda3\lib\site-packages\torch\nn\modules\module.py", line 1190, in _call_impl
return forward_call(*input, **kwargs)
File "J:\GRIT\models\caption\detector.py", line 53, in forward
features = self.backbone(x)
File "D:\Anaconda3\lib\site-packages\torch\nn\modules\module.py", line 1190, in _call_impl
return forward_call(*input, **kwargs)
File "J:\GRIT\models\common\swin_model.py", line 662, in forward
x_out, H, W, x, Wh, Ww = layer(x, Wh, Ww)
File "D:\Anaconda3\lib\site-packages\torch\nn\modules\module.py", line 1190, in _call_impl
return forward_call(input, **kwargs)
File "J:\GRIT\models\common\swin_model.py", line 448, in forward
x = blk(x, attn_mask)
File "D:\Anaconda3\lib\site-packages\torch\nn\modules\module.py", line 1190, in _call_impl
return forward_call(input, **kwargs)
File "J:\GRIT\models\common\swin_model.py", line 279, in forward
attn_windows = self.attn(x_windows, mask=attn_mask) # nWB, window_sizewindow_size, C
File "D:\Anaconda3\lib\site-packages\torch\nn\modules\module.py", line 1190, in _call_impl
return forward_call(*input, **kwargs)
File "J:\GRIT\models\common\swin_model.py", line 171, in forward
attn = attn + relative_position_bias.unsqueeze(0)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 568.00 MiB (GPU 0; 6.00 GiB total capacity; 3.99 GiB already allocated; 0 bytes free; 4.77 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.
Is there any way to solve it?
It is seen that you don't have enough GPU memory. I think you can reduce the batch size in the config:
grit/configs/caption/coco_config.yaml
Line 77 in 0e63d6a