CUDA Out-of-memory using V100
allanj opened this issue · 1 comments
allanj commented
I'm using V100 for experiments, but still out of memory in the middle of the training process. Not sure what would be the reason at this momnet
Namespace(aux_loss=True, backbone='resnet50', batch_size=4, bbox_loss_coef=5, clip_max_norm=0.1, coco_panoptic_path=None, coco_path='./coco2017/', dataset_file='coco', dec_layers=6, device='cuda', dice_loss_coef=1, dilation=False, dim_feedforward=1024, dist_backend='nccl', dist_url='env://', distributed=True, dropout=0.1, enc_layers=6, eos_coef=0.1, epochs=300, eval=False, frozen_weights=None, giou_loss_coef=2, gpu=0, hidden_dim=256, lr=0.0005, lr_backbone=1e-05, lr_drop=200, mask_loss_coef=1, masks=False, nheads=8, num_queries=100, num_workers=2, output_dir='./output', position_embedding='sine', pre_norm=False, rank=0, remove_difficult=False, resume='', seed=42, set_cost_bbox=5, set_cost_class=1, set_cost_giou=2, start_epoch=0, weight_decay=0.0001, world_size=8)
Downloading: "https://download.pytorch.org/models/resnet50-19c8e357.pth" to /home/tiger/.cache/torch/hub/checkpoints/resnet50-19c8e357.pth
100%|██████████| 97.8M/97.8M [00:09<00:00, 10.3MB/s]
number of params: 36104659
loading annotations into memory...
Done (t=13.57s)
creating index...
index created!
loading annotations into memory...
Done (t=0.44s)
creating index...
index created!
Start training
Epoch: [0] [ 0/3696] eta: 2:32:25 lr: 0.000100 loss: 7.6000 (7.6000) at: 7.6000 (7.6000) at_unscaled: 7.6000 (7.6000) time: 2.4743 data: 0.5030 max mem: 14737
Epoch: [0] [ 10/3696] eta: 0:59:14 lr: 0.000100 loss: 7.5261 (7.5307) at: 7.5261 (7.5307) at_unscaled: 7.5261 (7.5307) time: 0.9643 data: 0.0806 max mem: 25656
Epoch: [0] [ 20/3696] eta: 0:56:49 lr: 0.000100 loss: 7.4746 (7.4774) at: 7.4746 (7.4774) at_unscaled: 7.4746 (7.4774) time: 0.8501 data: 0.0390 max mem: 25656
Epoch: [0] [ 30/3696] eta: 0:54:22 lr: 0.000100 loss: 7.3449 (7.4215) at: 7.3449 (7.4215) at_unscaled: 7.3449 (7.4215) time: 0.8489 data: 0.0374 max mem: 25656
Epoch: [0] [ 40/3696] eta: 0:54:59 lr: 0.000100 loss: 7.2054 (7.3429) at: 7.2054 (7.3429) at_unscaled: 7.2054 (7.3429) time: 0.8761 data: 0.0356 max mem: 25656
Epoch: [0] [ 50/3696] eta: 0:53:30 lr: 0.000100 loss: 7.0288 (7.2657) at: 7.0288 (7.2657) at_unscaled: 7.0288 (7.2657) time: 0.8662 data: 0.0362 max mem: 25656
Epoch: [0] [ 60/3696] eta: 0:53:44 lr: 0.000100 loss: 6.8423 (7.1774) at: 6.8423 (7.1774) at_unscaled: 6.8423 (7.1774) time: 0.8553 data: 0.0368 max mem: 26623
Epoch: [0] [ 70/3696] eta: 0:53:36 lr: 0.000100 loss: 6.6867 (7.0967) at: 6.6867 (7.0967) at_unscaled: 6.6867 (7.0967) time: 0.9036 data: 0.0359 max mem: 26623
Epoch: [0] [ 80/3696] eta: 0:52:42 lr: 0.000100 loss: 6.5043 (7.0184) at: 6.5043 (7.0184) at_unscaled: 6.5043 (7.0184) time: 0.8368 data: 0.0351 max mem: 26623
Epoch: [0] [ 90/3696] eta: 0:52:17 lr: 0.000100 loss: 6.4531 (6.9577) at: 6.4531 (6.9577) at_unscaled: 6.4531 (6.9577) time: 0.8094 data: 0.0362 max mem: 26623
Epoch: [0] [ 100/3696] eta: 0:51:33 lr: 0.000100 loss: 6.4151 (6.8982) at: 6.4151 (6.8982) at_unscaled: 6.4151 (6.8982) time: 0.8019 data: 0.0386 max mem: 26623
Epoch: [0] [ 110/3696] eta: 0:51:10 lr: 0.000100 loss: 6.3319 (6.8437) at: 6.3319 (6.8437) at_unscaled: 6.3319 (6.8437) time: 0.7937 data: 0.0392 max mem: 26623
Epoch: [0] [ 120/3696] eta: 0:50:56 lr: 0.000100 loss: 6.2714 (6.7969) at: 6.2714 (6.7969) at_unscaled: 6.2714 (6.7969) time: 0.8268 data: 0.0377 max mem: 26623
Epoch: [0] [ 130/3696] eta: 0:50:36 lr: 0.000100 loss: 6.2584 (6.7519) at: 6.2584 (6.7519) at_unscaled: 6.2584 (6.7519) time: 0.8254 data: 0.0372 max mem: 26623
Epoch: [0] [ 140/3696] eta: 0:50:25 lr: 0.000100 loss: 6.2035 (6.7111) at: 6.2035 (6.7111) at_unscaled: 6.2035 (6.7111) time: 0.8266 data: 0.0372 max mem: 29528
Epoch: [0] [ 150/3696] eta: 0:49:55 lr: 0.000100 loss: 6.1476 (6.6716) at: 6.1476 (6.6716) at_unscaled: 6.1476 (6.6716) time: 0.8011 data: 0.0375 max mem: 29528
Epoch: [0] [ 160/3696] eta: 0:49:27 lr: 0.000100 loss: 6.0711 (6.6330) at: 6.0711 (6.6330) at_unscaled: 6.0711 (6.6330) time: 0.7585 data: 0.0372 max mem: 29528
Epoch: [0] [ 170/3696] eta: 0:49:10 lr: 0.000100 loss: 6.0247 (6.5969) at: 6.0247 (6.5969) at_unscaled: 6.0247 (6.5969) time: 0.7769 data: 0.0358 max mem: 29528
Epoch: [0] [ 180/3696] eta: 0:49:27 lr: 0.000100 loss: 5.9822 (6.5631) at: 5.9822 (6.5631) at_unscaled: 5.9822 (6.5631) time: 0.8812 data: 0.0361 max mem: 29528
Epoch: [0] [ 190/3696] eta: 0:49:06 lr: 0.000100 loss: 5.9351 (6.5278) at: 5.9351 (6.5278) at_unscaled: 5.9351 (6.5278) time: 0.8712 data: 0.0371 max mem: 29528
Epoch: [0] [ 200/3696] eta: 0:48:45 lr: 0.000100 loss: 5.8904 (6.4953) at: 5.8904 (6.4953) at_unscaled: 5.8904 (6.4953) time: 0.7744 data: 0.0355 max mem: 29528
Epoch: [0] [ 210/3696] eta: 0:48:35 lr: 0.000100 loss: 5.8645 (6.4635) at: 5.8645 (6.4635) at_unscaled: 5.8645 (6.4635) time: 0.7968 data: 0.0348 max mem: 29528
Epoch: [0] [ 220/3696] eta: 0:48:17 lr: 0.000100 loss: 5.8032 (6.4343) at: 5.8032 (6.4343) at_unscaled: 5.8032 (6.4343) time: 0.7998 data: 0.0354 max mem: 29528
Epoch: [0] [ 230/3696] eta: 0:47:58 lr: 0.000100 loss: 5.7949 (6.4067) at: 5.7949 (6.4067) at_unscaled: 5.7949 (6.4067) time: 0.7687 data: 0.0362 max mem: 29528
Epoch: [0] [ 240/3696] eta: 0:47:45 lr: 0.000100 loss: 5.7568 (6.3776) at: 5.7568 (6.3776) at_unscaled: 5.7568 (6.3776) time: 0.7808 data: 0.0371 max mem: 29528
Epoch: [0] [ 250/3696] eta: 0:47:30 lr: 0.000100 loss: 5.7063 (6.3502) at: 5.7063 (6.3502) at_unscaled: 5.7063 (6.3502) time: 0.7889 data: 0.0366 max mem: 29528
Epoch: [0] [ 260/3696] eta: 0:47:11 lr: 0.000100 loss: 5.6821 (6.3225) at: 5.6821 (6.3225) at_unscaled: 5.6821 (6.3225) time: 0.7617 data: 0.0362 max mem: 29528
Epoch: [0] [ 270/3696] eta: 0:47:00 lr: 0.000100 loss: 5.6091 (6.2965) at: 5.6091 (6.2965) at_unscaled: 5.6091 (6.2965) time: 0.7725 data: 0.0366 max mem: 29528
Epoch: [0] [ 280/3696] eta: 0:46:48 lr: 0.000100 loss: 5.6024 (6.2713) at: 5.6024 (6.2713) at_unscaled: 5.6024 (6.2713) time: 0.7982 data: 0.0366 max mem: 29528
Epoch: [0] [ 290/3696] eta: 0:46:48 lr: 0.000100 loss: 5.5578 (6.2455) at: 5.5578 (6.2455) at_unscaled: 5.5578 (6.2455) time: 0.8433 data: 0.0370 max mem: 29528
Epoch: [0] [ 300/3696] eta: 0:46:36 lr: 0.000100 loss: 5.5396 (6.2221) at: 5.5396 (6.2221) at_unscaled: 5.5396 (6.2221) time: 0.8398 data: 0.0373 max mem: 29528
Epoch: [0] [ 310/3696] eta: 0:46:23 lr: 0.000100 loss: 5.5059 (6.1994) at: 5.5059 (6.1994) at_unscaled: 5.5059 (6.1994) time: 0.7842 data: 0.0374 max mem: 29528
Epoch: [0] [ 320/3696] eta: 0:46:12 lr: 0.000100 loss: 5.4888 (6.1767) at: 5.4888 (6.1767) at_unscaled: 5.4888 (6.1767) time: 0.7882 data: 0.0370 max mem: 29528
Epoch: [0] [ 330/3696] eta: 0:45:58 lr: 0.000100 loss: 5.4756 (6.1560) at: 5.4756 (6.1560) at_unscaled: 5.4756 (6.1560) time: 0.7820 data: 0.0365 max mem: 29528
Epoch: [0] [ 340/3696] eta: 0:45:49 lr: 0.000100 loss: 5.4458 (6.1354) at: 5.4458 (6.1354) at_unscaled: 5.4458 (6.1354) time: 0.7886 data: 0.0363 max mem: 29528
Epoch: [0] [ 350/3696] eta: 0:45:42 lr: 0.000100 loss: 5.4504 (6.1157) at: 5.4504 (6.1157) at_unscaled: 5.4504 (6.1157) time: 0.8230 data: 0.0364 max mem: 29528
Epoch: [0] [ 360/3696] eta: 0:45:34 lr: 0.000100 loss: 5.4683 (6.0973) at: 5.4683 (6.0973) at_unscaled: 5.4683 (6.0973) time: 0.8292 data: 0.0370 max mem: 29528
Epoch: [0] [ 370/3696] eta: 0:45:30 lr: 0.000100 loss: 5.4665 (6.0802) at: 5.4665 (6.0802) at_unscaled: 5.4665 (6.0802) time: 0.8410 data: 0.0357 max mem: 29528
Epoch: [0] [ 380/3696] eta: 0:45:22 lr: 0.000100 loss: 5.4943 (6.0647) at: 5.4943 (6.0647) at_unscaled: 5.4943 (6.0647) time: 0.8443 data: 0.0360 max mem: 29528
Epoch: [0] [ 390/3696] eta: 0:45:13 lr: 0.000100 loss: 5.4801 (6.0489) at: 5.4801 (6.0489) at_unscaled: 5.4801 (6.0489) time: 0.8209 data: 0.0371 max mem: 29528
Epoch: [0] [ 400/3696] eta: 0:45:14 lr: 0.000100 loss: 5.4442 (6.0338) at: 5.4442 (6.0338) at_unscaled: 5.4442 (6.0338) time: 0.8706 data: 0.0372 max mem: 29528
Epoch: [0] [ 410/3696] eta: 0:45:03 lr: 0.000100 loss: 5.4351 (6.0182) at: 5.4351 (6.0182) at_unscaled: 5.4351 (6.0182) time: 0.8613 data: 0.0376 max mem: 29528
Epoch: [0] [ 420/3696] eta: 0:44:50 lr: 0.000100 loss: 5.3845 (6.0028) at: 5.3845 (6.0028) at_unscaled: 5.3845 (6.0028) time: 0.7759 data: 0.0373 max mem: 29528
Epoch: [0] [ 430/3696] eta: 0:45:03 lr: 0.000100 loss: 5.3922 (5.9884) at: 5.3922 (5.9884) at_unscaled: 5.3922 (5.9884) time: 0.9318 data: 0.0361 max mem: 29528
Epoch: [0] [ 440/3696] eta: 0:44:50 lr: 0.000100 loss: 5.4115 (5.9759) at: 5.4115 (5.9759) at_unscaled: 5.4115 (5.9759) time: 0.9331 data: 0.0361 max mem: 29528
Epoch: [0] [ 450/3696] eta: 0:44:43 lr: 0.000100 loss: 5.4180 (5.9631) at: 5.4180 (5.9631) at_unscaled: 5.4180 (5.9631) time: 0.8017 data: 0.0359 max mem: 29528
Epoch: [0] [ 460/3696] eta: 0:44:29 lr: 0.000100 loss: 5.3881 (5.9501) at: 5.3881 (5.9501) at_unscaled: 5.3881 (5.9501) time: 0.7948 data: 0.0355 max mem: 29528
Epoch: [0] [ 470/3696] eta: 0:44:18 lr: 0.000100 loss: 5.3906 (5.9391) at: 5.3906 (5.9391) at_unscaled: 5.3906 (5.9391) time: 0.7668 data: 0.0371 max mem: 29528
Epoch: [0] [ 480/3696] eta: 0:44:10 lr: 0.000100 loss: 5.3906 (5.9277) at: 5.3906 (5.9277) at_unscaled: 5.3906 (5.9277) time: 0.8013 data: 0.0390 max mem: 29528
Epoch: [0] [ 490/3696] eta: 0:44:03 lr: 0.000100 loss: 5.4143 (5.9179) at: 5.4143 (5.9179) at_unscaled: 5.4143 (5.9179) time: 0.8300 data: 0.0391 max mem: 29528
Epoch: [0] [ 500/3696] eta: 0:43:54 lr: 0.000100 loss: 5.4093 (5.9075) at: 5.4093 (5.9075) at_unscaled: 5.4093 (5.9075) time: 0.8303 data: 0.0378 max mem: 29528
Epoch: [0] [ 510/3696] eta: 0:43:43 lr: 0.000100 loss: 5.3890 (5.8972) at: 5.3890 (5.8972) at_unscaled: 5.3890 (5.8972) time: 0.7958 data: 0.0367 max mem: 29528
Epoch: [0] [ 520/3696] eta: 0:43:31 lr: 0.000100 loss: 5.3959 (5.8872) at: 5.3959 (5.8872) at_unscaled: 5.3959 (5.8872) time: 0.7730 data: 0.0355 max mem: 29528
Epoch: [0] [ 530/3696] eta: 0:43:22 lr: 0.000100 loss: 5.3743 (5.8775) at: 5.3743 (5.8775) at_unscaled: 5.3743 (5.8775) time: 0.7915 data: 0.0358 max mem: 29528
Epoch: [0] [ 540/3696] eta: 0:43:12 lr: 0.000100 loss: 5.3725 (5.8675) at: 5.3725 (5.8675) at_unscaled: 5.3725 (5.8675) time: 0.8013 data: 0.0355 max mem: 29528
Epoch: [0] [ 550/3696] eta: 0:43:02 lr: 0.000100 loss: 5.3403 (5.8580) at: 5.3403 (5.8580) at_unscaled: 5.3403 (5.8580) time: 0.7922 data: 0.0349 max mem: 29528
Epoch: [0] [ 560/3696] eta: 0:42:52 lr: 0.000100 loss: 5.3460 (5.8494) at: 5.3460 (5.8494) at_unscaled: 5.3460 (5.8494) time: 0.7893 data: 0.0355 max mem: 29528
Epoch: [0] [ 570/3696] eta: 0:42:43 lr: 0.000100 loss: 5.3509 (5.8408) at: 5.3509 (5.8408) at_unscaled: 5.3509 (5.8408) time: 0.7901 data: 0.0359 max mem: 29528
Epoch: [0] [ 580/3696] eta: 0:42:31 lr: 0.000100 loss: 5.3509 (5.8328) at: 5.3509 (5.8328) at_unscaled: 5.3509 (5.8328) time: 0.7762 data: 0.0358 max mem: 29528
Epoch: [0] [ 590/3696] eta: 0:42:22 lr: 0.000100 loss: 5.3572 (5.8243) at: 5.3572 (5.8243) at_unscaled: 5.3572 (5.8243) time: 0.7785 data: 0.0351 max mem: 29528
Epoch: [0] [ 600/3696] eta: 0:42:11 lr: 0.000100 loss: 5.3541 (5.8163) at: 5.3541 (5.8163) at_unscaled: 5.3541 (5.8163) time: 0.7857 data: 0.0343 max mem: 29528
Epoch: [0] [ 610/3696] eta: 0:41:59 lr: 0.000100 loss: 5.3445 (5.8085) at: 5.3445 (5.8085) at_unscaled: 5.3445 (5.8085) time: 0.7585 data: 0.0351 max mem: 29528
Epoch: [0] [ 620/3696] eta: 0:41:54 lr: 0.000100 loss: 5.3499 (5.8015) at: 5.3499 (5.8015) at_unscaled: 5.3499 (5.8015) time: 0.8055 data: 0.0354 max mem: 29528
Epoch: [0] [ 630/3696] eta: 0:41:42 lr: 0.000100 loss: 5.3499 (5.7940) at: 5.3499 (5.7940) at_unscaled: 5.3499 (5.7940) time: 0.8031 data: 0.0343 max mem: 29528
Epoch: [0] [ 640/3696] eta: 0:41:31 lr: 0.000100 loss: 5.3273 (5.7865) at: 5.3273 (5.7865) at_unscaled: 5.3273 (5.7865) time: 0.7553 data: 0.0356 max mem: 29528
Epoch: [0] [ 650/3696] eta: 0:41:22 lr: 0.000100 loss: 5.3314 (5.7792) at: 5.3314 (5.7792) at_unscaled: 5.3314 (5.7792) time: 0.7825 data: 0.0378 max mem: 29528
Epoch: [0] [ 660/3696] eta: 0:41:16 lr: 0.000100 loss: 5.3259 (5.7719) at: 5.3259 (5.7719) at_unscaled: 5.3259 (5.7719) time: 0.8199 data: 0.0371 max mem: 29528
Epoch: [0] [ 670/3696] eta: 0:41:06 lr: 0.000100 loss: 5.2930 (5.7651) at: 5.2930 (5.7651) at_unscaled: 5.2930 (5.7651) time: 0.8170 data: 0.0351 max mem: 29528
Epoch: [0] [ 680/3696] eta: 0:40:57 lr: 0.000100 loss: 5.2930 (5.7582) at: 5.2930 (5.7582) at_unscaled: 5.2930 (5.7582) time: 0.7851 data: 0.0354 max mem: 29528
Epoch: [0] [ 690/3696] eta: 0:40:49 lr: 0.000100 loss: 5.2727 (5.7514) at: 5.2727 (5.7514) at_unscaled: 5.2727 (5.7514) time: 0.8068 data: 0.0353 max mem: 29528
Epoch: [0] [ 700/3696] eta: 0:40:41 lr: 0.000100 loss: 5.2917 (5.7451) at: 5.2917 (5.7451) at_unscaled: 5.2917 (5.7451) time: 0.8184 data: 0.0348 max mem: 29528
Epoch: [0] [ 710/3696] eta: 0:40:31 lr: 0.000100 loss: 5.2949 (5.7387) at: 5.2949 (5.7387) at_unscaled: 5.2949 (5.7387) time: 0.7904 data: 0.0358 max mem: 29528
Epoch: [0] [ 720/3696] eta: 0:40:21 lr: 0.000100 loss: 5.2874 (5.7325) at: 5.2874 (5.7325) at_unscaled: 5.2874 (5.7325) time: 0.7719 data: 0.0376 max mem: 29528
Epoch: [0] [ 730/3696] eta: 0:40:10 lr: 0.000100 loss: 5.2801 (5.7262) at: 5.2801 (5.7262) at_unscaled: 5.2801 (5.7262) time: 0.7581 data: 0.0372 max mem: 29528
Epoch: [0] [ 740/3696] eta: 0:40:02 lr: 0.000100 loss: 5.2634 (5.7196) at: 5.2634 (5.7196) at_unscaled: 5.2634 (5.7196) time: 0.7769 data: 0.0357 max mem: 29528
Epoch: [0] [ 750/3696] eta: 0:39:53 lr: 0.000100 loss: 5.2367 (5.7135) at: 5.2367 (5.7135) at_unscaled: 5.2367 (5.7135) time: 0.8039 data: 0.0365 max mem: 29528
Epoch: [0] [ 760/3696] eta: 0:39:43 lr: 0.000100 loss: 5.2874 (5.7082) at: 5.2874 (5.7082) at_unscaled: 5.2874 (5.7082) time: 0.7800 data: 0.0367 max mem: 29528
Epoch: [0] [ 770/3696] eta: 0:39:33 lr: 0.000100 loss: 5.2954 (5.7024) at: 5.2954 (5.7024) at_unscaled: 5.2954 (5.7024) time: 0.7681 data: 0.0356 max mem: 29528
Epoch: [0] [ 780/3696] eta: 0:39:23 lr: 0.000100 loss: 5.3127 (5.6975) at: 5.3127 (5.6975) at_unscaled: 5.3127 (5.6975) time: 0.7632 data: 0.0361 max mem: 29528
Epoch: [0] [ 790/3696] eta: 0:39:14 lr: 0.000100 loss: 5.3130 (5.6919) at: 5.3130 (5.6919) at_unscaled: 5.3130 (5.6919) time: 0.7715 data: 0.0359 max mem: 29528
Epoch: [0] [ 800/3696] eta: 0:39:06 lr: 0.000100 loss: 5.2498 (5.6860) at: 5.2498 (5.6860) at_unscaled: 5.2498 (5.6860) time: 0.7954 data: 0.0369 max mem: 29528
Epoch: [0] [ 810/3696] eta: 0:38:58 lr: 0.000100 loss: 5.2336 (5.6804) at: 5.2336 (5.6804) at_unscaled: 5.2336 (5.6804) time: 0.8095 data: 0.0380 max mem: 29528
Epoch: [0] [ 820/3696] eta: 0:38:50 lr: 0.000100 loss: 5.2354 (5.6755) at: 5.2354 (5.6755) at_unscaled: 5.2354 (5.6755) time: 0.8130 data: 0.0356 max mem: 29528
Epoch: [0] [ 830/3696] eta: 0:38:39 lr: 0.000100 loss: 5.2691 (5.6704) at: 5.2691 (5.6704) at_unscaled: 5.2691 (5.6704) time: 0.7757 data: 0.0355 max mem: 29528
Epoch: [0] [ 840/3696] eta: 0:38:31 lr: 0.000100 loss: 5.2588 (5.6653) at: 5.2588 (5.6653) at_unscaled: 5.2588 (5.6653) time: 0.7692 data: 0.0369 max mem: 29528
Epoch: [0] [ 850/3696] eta: 0:38:23 lr: 0.000100 loss: 5.2564 (5.6606) at: 5.2564 (5.6606) at_unscaled: 5.2564 (5.6606) time: 0.8133 data: 0.0363 max mem: 29528
Epoch: [0] [ 860/3696] eta: 0:38:15 lr: 0.000100 loss: 5.2448 (5.6556) at: 5.2448 (5.6556) at_unscaled: 5.2448 (5.6556) time: 0.8129 data: 0.0352 max mem: 29528
Epoch: [0] [ 870/3696] eta: 0:38:05 lr: 0.000100 loss: 5.2326 (5.6506) at: 5.2326 (5.6506) at_unscaled: 5.2326 (5.6506) time: 0.7795 data: 0.0351 max mem: 29528
Epoch: [0] [ 880/3696] eta: 0:37:56 lr: 0.000100 loss: 5.2049 (5.6456) at: 5.2049 (5.6456) at_unscaled: 5.2049 (5.6456) time: 0.7750 data: 0.0364 max mem: 29528
Epoch: [0] [ 890/3696] eta: 0:37:47 lr: 0.000100 loss: 5.2049 (5.6407) at: 5.2049 (5.6407) at_unscaled: 5.2049 (5.6407) time: 0.7812 data: 0.0367 max mem: 29528
Epoch: [0] [ 900/3696] eta: 0:37:37 lr: 0.000100 loss: 5.1690 (5.6354) at: 5.1690 (5.6354) at_unscaled: 5.1690 (5.6354) time: 0.7607 data: 0.0348 max mem: 29528
Epoch: [0] [ 910/3696] eta: 0:37:31 lr: 0.000100 loss: 5.1836 (5.6309) at: 5.1836 (5.6309) at_unscaled: 5.1836 (5.6309) time: 0.8035 data: 0.0355 max mem: 29528
Epoch: [0] [ 920/3696] eta: 0:37:22 lr: 0.000100 loss: 5.2129 (5.6261) at: 5.2129 (5.6261) at_unscaled: 5.2129 (5.6261) time: 0.8221 data: 0.0381 max mem: 29528
Epoch: [0] [ 930/3696] eta: 0:37:13 lr: 0.000100 loss: 5.1586 (5.6210) at: 5.1586 (5.6210) at_unscaled: 5.1586 (5.6210) time: 0.7758 data: 0.0377 max mem: 29528
Epoch: [0] [ 940/3696] eta: 0:37:05 lr: 0.000100 loss: 5.1586 (5.6162) at: 5.1586 (5.6162) at_unscaled: 5.1586 (5.6162) time: 0.7975 data: 0.0355 max mem: 29528
Epoch: [0] [ 950/3696] eta: 0:36:56 lr: 0.000100 loss: 5.1713 (5.6120) at: 5.1713 (5.6120) at_unscaled: 5.1713 (5.6120) time: 0.7970 data: 0.0358 max mem: 29528
Epoch: [0] [ 960/3696] eta: 0:36:47 lr: 0.000100 loss: 5.1839 (5.6077) at: 5.1839 (5.6077) at_unscaled: 5.1839 (5.6077) time: 0.7714 data: 0.0367 max mem: 29528
Epoch: [0] [ 970/3696] eta: 0:36:38 lr: 0.000100 loss: 5.1800 (5.6036) at: 5.1800 (5.6036) at_unscaled: 5.1800 (5.6036) time: 0.7812 data: 0.0363 max mem: 29528
Epoch: [0] [ 980/3696] eta: 0:36:30 lr: 0.000100 loss: 5.2028 (5.5995) at: 5.2028 (5.5995) at_unscaled: 5.2028 (5.5995) time: 0.7996 data: 0.0349 max mem: 29528
Epoch: [0] [ 990/3696] eta: 0:36:23 lr: 0.000100 loss: 5.2028 (5.5954) at: 5.2028 (5.5954) at_unscaled: 5.2028 (5.5954) time: 0.8110 data: 0.0353 max mem: 29528
Epoch: [0] [1000/3696] eta: 0:36:14 lr: 0.000100 loss: 5.1880 (5.5914) at: 5.1880 (5.5914) at_unscaled: 5.1880 (5.5914) time: 0.7950 data: 0.0369 max mem: 29528
Epoch: [0] [1010/3696] eta: 0:36:04 lr: 0.000100 loss: 5.1773 (5.5870) at: 5.1773 (5.5870) at_unscaled: 5.1773 (5.5870) time: 0.7645 data: 0.0368 max mem: 29528
Epoch: [0] [1020/3696] eta: 0:35:57 lr: 0.000100 loss: 5.2493 (5.5836) at: 5.2493 (5.5836) at_unscaled: 5.2493 (5.5836) time: 0.7915 data: 0.0360 max mem: 29528
Epoch: [0] [1030/3696] eta: 0:35:49 lr: 0.000100 loss: 5.1982 (5.5793) at: 5.1982 (5.5793) at_unscaled: 5.1982 (5.5793) time: 0.8164 data: 0.0363 max mem: 29528
Epoch: [0] [1040/3696] eta: 0:35:41 lr: 0.000100 loss: 5.1446 (5.5754) at: 5.1446 (5.5754) at_unscaled: 5.1446 (5.5754) time: 0.8053 data: 0.0375 max mem: 29528
Epoch: [0] [1050/3696] eta: 0:35:31 lr: 0.000100 loss: 5.1319 (5.5714) at: 5.1319 (5.5714) at_unscaled: 5.1319 (5.5714) time: 0.7766 data: 0.0359 max mem: 29528
Epoch: [0] [1060/3696] eta: 0:35:22 lr: 0.000100 loss: 5.2017 (5.5679) at: 5.2017 (5.5679) at_unscaled: 5.2017 (5.5679) time: 0.7481 data: 0.0365 max mem: 29528
Epoch: [0] [1070/3696] eta: 0:35:13 lr: 0.000100 loss: 5.2017 (5.5642) at: 5.2017 (5.5642) at_unscaled: 5.2017 (5.5642) time: 0.7754 data: 0.0387 max mem: 29528
Epoch: [0] [1080/3696] eta: 0:35:03 lr: 0.000100 loss: 5.1192 (5.5603) at: 5.1192 (5.5603) at_unscaled: 5.1192 (5.5603) time: 0.7605 data: 0.0383 max mem: 29528
Epoch: [0] [1090/3696] eta: 0:34:56 lr: 0.000100 loss: 5.1105 (5.5560) at: 5.1105 (5.5560) at_unscaled: 5.1105 (5.5560) time: 0.7700 data: 0.0379 max mem: 29528
Epoch: [0] [1100/3696] eta: 0:34:47 lr: 0.000100 loss: 5.1321 (5.5524) at: 5.1321 (5.5524) at_unscaled: 5.1321 (5.5524) time: 0.8007 data: 0.0380 max mem: 29528
Epoch: [0] [1110/3696] eta: 0:34:39 lr: 0.000100 loss: 5.1603 (5.5489) at: 5.1603 (5.5489) at_unscaled: 5.1603 (5.5489) time: 0.7850 data: 0.0382 max mem: 29528
Epoch: [0] [1120/3696] eta: 0:34:30 lr: 0.000100 loss: 5.1443 (5.5452) at: 5.1443 (5.5452) at_unscaled: 5.1443 (5.5452) time: 0.7765 data: 0.0383 max mem: 29528
Epoch: [0] [1130/3696] eta: 0:34:21 lr: 0.000100 loss: 5.1185 (5.5413) at: 5.1185 (5.5413) at_unscaled: 5.1185 (5.5413) time: 0.7790 data: 0.0372 max mem: 29528
Epoch: [0] [1140/3696] eta: 0:34:13 lr: 0.000100 loss: 5.0800 (5.5374) at: 5.0800 (5.5374) at_unscaled: 5.0800 (5.5374) time: 0.7986 data: 0.0356 max mem: 29528
Epoch: [0] [1150/3696] eta: 0:34:04 lr: 0.000100 loss: 5.1101 (5.5337) at: 5.1101 (5.5337) at_unscaled: 5.1101 (5.5337) time: 0.7654 data: 0.0345 max mem: 29528
Epoch: [0] [1160/3696] eta: 0:33:56 lr: 0.000100 loss: 5.1744 (5.5307) at: 5.1744 (5.5307) at_unscaled: 5.1744 (5.5307) time: 0.7695 data: 0.0344 max mem: 29528
Epoch: [0] [1170/3696] eta: 0:33:47 lr: 0.000100 loss: 5.1829 (5.5277) at: 5.1829 (5.5277) at_unscaled: 5.1829 (5.5277) time: 0.7968 data: 0.0362 max mem: 29528
Epoch: [0] [1180/3696] eta: 0:33:40 lr: 0.000100 loss: 5.1845 (5.5246) at: 5.1845 (5.5246) at_unscaled: 5.1845 (5.5246) time: 0.8120 data: 0.0374 max mem: 29528
Epoch: [0] [1190/3696] eta: 0:33:32 lr: 0.000100 loss: 5.1798 (5.5216) at: 5.1798 (5.5216) at_unscaled: 5.1798 (5.5216) time: 0.8169 data: 0.0371 max mem: 29528
Epoch: [0] [1200/3696] eta: 0:33:23 lr: 0.000100 loss: 5.1929 (5.5188) at: 5.1929 (5.5188) at_unscaled: 5.1929 (5.5188) time: 0.7739 data: 0.0361 max mem: 29528
Epoch: [0] [1210/3696] eta: 0:33:16 lr: 0.000100 loss: 5.1929 (5.5158) at: 5.1929 (5.5158) at_unscaled: 5.1929 (5.5158) time: 0.7985 data: 0.0340 max mem: 29528
Epoch: [0] [1220/3696] eta: 0:33:07 lr: 0.000100 loss: 5.1322 (5.5126) at: 5.1322 (5.5126) at_unscaled: 5.1322 (5.5126) time: 0.8027 data: 0.0350 max mem: 29528
Epoch: [0] [1230/3696] eta: 0:32:59 lr: 0.000100 loss: 5.1595 (5.5096) at: 5.1595 (5.5096) at_unscaled: 5.1595 (5.5096) time: 0.7881 data: 0.0374 max mem: 29528
Epoch: [0] [1240/3696] eta: 0:32:50 lr: 0.000100 loss: 5.1620 (5.5067) at: 5.1620 (5.5067) at_unscaled: 5.1620 (5.5067) time: 0.7849 data: 0.0365 max mem: 29528
Epoch: [0] [1250/3696] eta: 0:32:42 lr: 0.000100 loss: 5.1620 (5.5038) at: 5.1620 (5.5038) at_unscaled: 5.1620 (5.5038) time: 0.7893 data: 0.0357 max mem: 29528
Epoch: [0] [1260/3696] eta: 0:32:34 lr: 0.000100 loss: 5.1245 (5.5005) at: 5.1245 (5.5005) at_unscaled: 5.1245 (5.5005) time: 0.8002 data: 0.0359 max mem: 29528
Epoch: [0] [1270/3696] eta: 0:32:26 lr: 0.000100 loss: 5.1023 (5.4975) at: 5.1023 (5.4975) at_unscaled: 5.1023 (5.4975) time: 0.8015 data: 0.0362 max mem: 29528
Epoch: [0] [1280/3696] eta: 0:32:17 lr: 0.000100 loss: 5.1132 (5.4946) at: 5.1132 (5.4946) at_unscaled: 5.1132 (5.4946) time: 0.7906 data: 0.0349 max mem: 29528
Epoch: [0] [1290/3696] eta: 0:32:09 lr: 0.000100 loss: 5.1292 (5.4918) at: 5.1292 (5.4918) at_unscaled: 5.1292 (5.4918) time: 0.7743 data: 0.0334 max mem: 29528
Epoch: [0] [1300/3696] eta: 0:32:01 lr: 0.000100 loss: 5.1292 (5.4890) at: 5.1292 (5.4890) at_unscaled: 5.1292 (5.4890) time: 0.7875 data: 0.0339 max mem: 29528
Epoch: [0] [1310/3696] eta: 0:31:54 lr: 0.000100 loss: 5.1232 (5.4863) at: 5.1232 (5.4863) at_unscaled: 5.1232 (5.4863) time: 0.8117 data: 0.0343 max mem: 29528
Epoch: [0] [1320/3696] eta: 0:31:45 lr: 0.000100 loss: 5.1016 (5.4832) at: 5.1016 (5.4832) at_unscaled: 5.1016 (5.4832) time: 0.8161 data: 0.0341 max mem: 29528
Epoch: [0] [1330/3696] eta: 0:31:38 lr: 0.000100 loss: 5.0905 (5.4805) at: 5.0905 (5.4805) at_unscaled: 5.0905 (5.4805) time: 0.8149 data: 0.0343 max mem: 29528
Traceback (most recent call last):
File "main.py", line 257, in <module>
main(args)
File "main.py", line 207, in main
args.clip_max_norm, learning_rate_schedule)
File "/opt/tiger/intro/Stable-Pix2Seq/engine.py", line 98, in train_one_epoch
losses.backward()
File "/home/tiger/.local/lib/python3.7/site-packages/torch/tensor.py", line 245, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
File "/home/tiger/.local/lib/python3.7/site-packages/torch/autograd/__init__.py", line 147, in backward
allow_unreachable=True, accumulate_grad=True) # allow_unreachable flag
RuntimeError: CUDA out of memory. Tried to allocate 216.00 MiB (GPU 7; 31.75 GiB total capacity; 29.63 GiB already allocated; 213.75 MiB free; 29.95 GiB reserved in total by PyTorch)
Traceback (most recent call last):
File "/usr/lib/python3.7/runpy.py", line 193, in _run_module_as_main
"__main__", mod_spec)
File "/usr/lib/python3.7/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/home/tiger/.local/lib/python3.7/site-packages/torch/distributed/launch.py", line 340, in <module>
main()
File "/home/tiger/.local/lib/python3.7/site-packages/torch/distributed/launch.py", line 326, in main
sigkill_handler(signal.SIGTERM, None) # not coming back
File "/home/tiger/.local/lib/python3.7/site-packages/torch/distributed/launch.py", line 301, in sigkill_handler
raise subprocess.CalledProcessError(returncode=last_return_code, cmd=cmd)
subprocess.CalledProcessError: Command '['/usr/bin/python3', '-u', 'main.py', '--coco_path', './coco2017/', '--batch_size', '4', '--lr', '0.0005', '--output_dir', './output']' returned non-zero exit status 1.
Killing subprocess 5627
Killing subprocess 5628
Killing subprocess 5629
Killing subprocess 5630
Killing subprocess 5631
Killing subprocess 5632
Killing subprocess 5633
allanj commented
Changing 4 to 3 works for me though. 😞