training process becomes unresponsive halfway through

Question

training process becomes unresponsive halfway through

Closed this issue 7 months ago · 5 comments

Has anyone encountered a similar issue where the training process becomes unresponsive halfway through, with no additional log entries being recorded, yet no error is thrown?

Answer 1 · 2023-05-22T11:07:07.000Z

like:
1.7274575140626334e-06
(E:33, S:1 / 200) [Loss = 0.1450] (16.6 samples/sec; 0.964 sec/batch)
(E:33, S:2 / 200) [Loss = 0.1431] (17.4 samples/sec; 0.918 sec/batch)
(E:33, S:3 / 200) [Loss = 0.1425] (18.2 samples/sec; 0.878 sec/batch)
(E:33, S:4 / 200) [Loss = 0.1426] (19.0 samples/sec; 0.843 sec/batch)
(E:33, S:5 / 200) [Loss = 0.1429] (19.7 samples/sec; 0.811 sec/batch)
(E:33, S:6 / 200) [Loss = 0.1420] (19.3 samples/sec; 0.831 sec/batch)

Answer 2 · 2023-05-22T14:59:19.000Z

This is a bit strange. Does this problem occurs randomly or at a specific epoch (like E:33 as you show) ?

Answer 3 · 2023-05-23T03:09:26.000Z

This is a bit strange. Does this problem occurs randomly or at a specific epoch (like E:33 as you show) ?

It is random, when I interrupt the program, it reached this point:

(E:34, S:1 / 200) [Loss = 0.1484] (4.4 samples/sec; 0.916 sec/batch)
(E:34, S:2 / 200) [Loss = 0.1454] (4.6 samples/sec; 0.870 sec/batch)
^CTraceback (most recent call last):
File "/home/user5/code/IAA/LIQE/train_unique_clip_weight.py", line 739, in
best_result, best_epoch, srcc_dict, scene_dict, type_dict, all_result = train(model, best_result, best_epoch, srcc_dict,
File "/home/user5/code/IAA/LIQE/train_unique_clip_weight.py", line 195, in train
sample_batched = next(loader)
File "/home/user5/anaconda3/envs/testenv/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 521, in next
data = self._next_data()
File "/home/user5/anaconda3/envs/testenv/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1186, in _next_data
idx, data = self._get_data()
File "/home/user5/anaconda3/envs/testenv/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1142, in _get_data
success, data = self._try_get_data()
File "/home/user5/anaconda3/envs/testenv/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 990, in _try_get_data
data = self._data_queue.get(timeout=timeout)
File "/home/user5/anaconda3/envs/testenv/lib/python3.8/queue.py", line 179, in get
self.not_empty.wait(remaining)
File "/home/user5/anaconda3/envs/testenv/lib/python3.8/threading.py", line 306, in wait
gotit = waiter.acquire(True, timeout)
KeyboardInterrupt

Process finished with exit code 130

It appears that the data loading thread is not receiving data and is stuck waiting.

Answer 4 · 2023-12-12T04:17:24.000Z

like: 1.7274575140626334e-06 (E:33, S:1 / 200) [Loss = 0.1450] (16.6 samples/sec; 0.964 sec/batch) (E:33, S:2 / 200) [Loss = 0.1431] (17.4 samples/sec; 0.918 sec/batch) (E:33, S:3 / 200) [Loss = 0.1425] (18.2 samples/sec; 0.878 sec/batch) (E:33, S:4 / 200) [Loss = 0.1426] (19.0 samples/sec; 0.843 sec/batch) (E:33, S:5 / 200) [Loss = 0.1429] (19.7 samples/sec; 0.811 sec/batch) (E:33, S:6 / 200) [Loss = 0.1420] (19.3 samples/sec; 0.831 sec/batch)

<stop logging, don't know why..>

Hello, I would like to ask how you solved this problem?

Answer 5 · 2024-01-15T08:05:20.000Z

@JennyVanessa Hi, you may try this. #17 (comment)