Vegeta2020/SE-SSD

RuntimeError: Dataloader worker (pid(s) 19435) exited unexpectedly

Carl12138aka opened this issue · 1 comments

`2021-11-24 05:56:46,311 - INFO - Epoch [31/60][210/928] lr: 0.00278, eta: 11:25:38, time: 1.490, data_time: 0.018, transfer_time: 0.017, forward_time: 0.670, loss_parse_time: 0.000 memory: 3703,
2021-11-24 05:56:46,311 - INFO - task : ['Car'], loss: 0.8918, cls_loss_reduced: 0.1903, loc_loss_reduced: 0.2803, dir_loss_reduced: 0.0349, iou_pred_loss: 0.0804, consistency_loss: 0.0556, loc_loss_elem: ['0.0058', '0.0053', '0.0284', '0.0174', '0.0239', '0.0216', '0.0377'], cls_pos_loss: 0.1345, cls_neg_loss: 0.0558, ious_loss: 0.5306, num_pos: 72.9000, num_neg: 70218.3000, loss_ema: 0.2507, cls_loss_reduced_ema: 0.1573, loc_loss_reduced_ema: 0.2217, dir_loss_reduced_ema: 0.0244, iou_pred_loss_ema: 0.0690, loc_loss_elem_ema: ['0.0065', '0.0033', '0.0201', '0.0152', '0.0205', '0.0194', '0.0259'], cls_pos_loss_ema: 0.1030, cls_neg_loss_ema: 0.0544, num_pos_ema: 73.0000, num_neg_ema: 70216.3000

2021-11-24 05:57:01,158 - INFO - Epoch [31/60][220/928] lr: 0.00278, eta: 11:25:24, time: 1.485, data_time: 0.018, transfer_time: 0.017, forward_time: 0.681, loss_parse_time: 0.000 memory: 3703,
2021-11-24 05:57:01,158 - INFO - task : ['Car'], loss: 0.9807, cls_loss_reduced: 0.2094, loc_loss_reduced: 0.3321, dir_loss_reduced: 0.0484, iou_pred_loss: 0.0874, consistency_loss: 0.0610, loc_loss_elem: ['0.0071', '0.0057', '0.0318', '0.0202', '0.0308', '0.0267', '0.0437'], cls_pos_loss: 0.1559, cls_neg_loss: 0.0535, ious_loss: 0.5745, num_pos: 72.0000, num_neg: 70218.4000, loss_ema: 0.2532, cls_loss_reduced_ema: 0.1520, loc_loss_reduced_ema: 0.2008, dir_loss_reduced_ema: 0.0270, iou_pred_loss_ema: 0.0741, loc_loss_elem_ema: ['0.0059', '0.0033', '0.0217', '0.0140', '0.0191', '0.0182', '0.0182'], cls_pos_loss_ema: 0.0948, cls_neg_loss_ema: 0.0572, num_pos_ema: 77.2000, num_neg_ema: 70205.4000

2021-11-24 05:57:15,839 - INFO - Epoch [31/60][230/928] lr: 0.00278, eta: 11:25:08, time: 1.468, data_time: 0.020, transfer_time: 0.017, forward_time: 0.656, loss_parse_time: 0.000 memory: 3703,
2021-11-24 05:57:15,840 - INFO - task : ['Car'], loss: 0.8810, cls_loss_reduced: 0.1796, loc_loss_reduced: 0.2764, dir_loss_reduced: 0.0373, iou_pred_loss: 0.0820, consistency_loss: 0.0527, loc_loss_elem: ['0.0066', '0.0043', '0.0249', '0.0166', '0.0255', '0.0267', '0.0336'], cls_pos_loss: 0.1226, cls_neg_loss: 0.0569, ious_loss: 0.5294, num_pos: 69.8000, num_neg: 70222.8000, loss_ema: 0.2269, cls_loss_reduced_ema: 0.1348, loc_loss_reduced_ema: 0.1698, dir_loss_reduced_ema: 0.0238, iou_pred_loss_ema: 0.0682, loc_loss_elem_ema: ['0.0053', '0.0025', '0.0139', '0.0121', '0.0186', '0.0193', '0.0132'], cls_pos_loss_ema: 0.0787, cls_neg_loss_ema: 0.0561, num_pos_ema: 72.3000, num_neg_ema: 70217.5000

2021-11-24 05:57:30,428 - INFO - Epoch [31/60][240/928] lr: 0.00278, eta: 11:24:53, time: 1.459, data_time: 0.018, transfer_time: 0.017, forward_time: 0.661, loss_parse_time: 0.000 memory: 3703,
2021-11-24 05:57:30,428 - INFO - task : ['Car'], loss: 0.9554, cls_loss_reduced: 0.2016, loc_loss_reduced: 0.3169, dir_loss_reduced: 0.0402, iou_pred_loss: 0.0874, consistency_loss: 0.0569, loc_loss_elem: ['0.0068', '0.0049', '0.0294', '0.0192', '0.0280', '0.0259', '0.0441'], cls_pos_loss: 0.1453, cls_neg_loss: 0.0564, ious_loss: 0.5693, num_pos: 73.5000, num_neg: 70216.1000, loss_ema: 0.2422, cls_loss_reduced_ema: 0.1535, loc_loss_reduced_ema: 0.1874, dir_loss_reduced_ema: 0.0226, iou_pred_loss_ema: 0.0660, loc_loss_elem_ema: ['0.0054', '0.0029', '0.0182', '0.0127', '0.0205', '0.0151', '0.0190'], cls_pos_loss_ema: 0.0925, cls_neg_loss_ema: 0.0610, num_pos_ema: 75.9000, num_neg_ema: 70212.4000

Traceback (most recent call last):
File "/home/miao/anaconda3/envs/sessd/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 990, in _try_get_data
data = self._data_queue.get(timeout=timeout)
File "/home/miao/anaconda3/envs/sessd/lib/python3.7/multiprocessing/queues.py", line 104, in get
if not self._poll(timeout):
File "/home/miao/anaconda3/envs/sessd/lib/python3.7/multiprocessing/connection.py", line 257, in poll
return self._poll(timeout)
File "/home/miao/anaconda3/envs/sessd/lib/python3.7/multiprocessing/connection.py", line 414, in _poll
r = wait([self], timeout)
File "/home/miao/anaconda3/envs/sessd/lib/python3.7/multiprocessing/connection.py", line 913, in wait
with _WaitSelector() as selector:
File "/home/miao/anaconda3/envs/sessd/lib/python3.7/site-packages/torch/utils/data/_utils/signal_handling.py", line 66, in handler
_error_if_any_worker_fails()
RuntimeError: DataLoader worker (pid 19435) is killed by signal: Killed.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "train.py", line 118, in
main()
File "train.py", line 115, in main
train_detector(model, datasets, cfg, distributed=distributed, validate=args.validate, logger=logger,)
File "/home/miao/Music/SE-SSD/det3d/torchie/apis/train_sessd.py", line 323, in train_detector
trainer.run(data_loaders, cfg.workflow, cfg.total_epochs, local_rank=cfg.local_rank)
File "/home/miao/Music/SE-SSD/det3d/torchie/trainer/trainer_sessd.py", line 472, in run
epoch_runner(data_loaders[0], data_loaders[1], self.epoch, **kwargs)
File "/home/miao/Music/SE-SSD/det3d/torchie/trainer/trainer_sessd.py", line 333, in train
for i, data_batch in enumerate(data_loader):
File "/home/miao/anaconda3/envs/sessd/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 521, in next
data = self._next_data()
File "/home/miao/anaconda3/envs/sessd/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 1186, in _next_data
idx, data = self._get_data()
File "/home/miao/anaconda3/envs/sessd/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 1152, in _get_data
success, data = self._try_get_data()
File "/home/miao/anaconda3/envs/sessd/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 1003, in _try_get_data
raise RuntimeError('DataLoader worker (pid(s) {}) exited unexpectedly'.format(pids_str)) from e
RuntimeError: DataLoader worker (pid(s) 19435) exited unexpectedly`

Do anybody knows what happened? How can I train correctly?

I solved this problem by changing workers_per_gpu from 4 to 2.