[Question]Threading error after last train

Question

[Question]Threading error after last train

Closed this issue 16 days ago · 8 comments

❓ Question

Hi
I have a question ,In the last stage of training, there was an error when using batchgenerators. I noticed that someone had mentioned this issue before. Is there a solution now.

Exception in thread Thread-3:
Traceback (most recent call last):
File "/home/liuyvjie/opt/miniforge3/envs/nndet_venv/lib/python3.9/threading.py", line 980, in _bootstrap_inner
self.run()
File "/home/liuyvjie/opt/miniforge3/envs/nndet_venv/lib/python3.9/threading.py", line 917, in run
self._target(*self._args, **self._kwargs)
File "/home/liuyvjie/opt/miniforge3/envs/nndet_venv/lib/python3.9/site-packages/batchgenerators/dataloading/multi_threaded_augmenter.py", line 92, in results_loop
raise RuntimeError("One or more background workers are no longer alive. Exiting. Please check the print"
RuntimeError: One or more background workers are no longer alive. Exiting. Please check the print statements above for the actual error message
Exception in thread Thread-4:
Traceback (most recent call last):
File "/home/liuyvjie/opt/miniforge3/envs/nndet_venv/lib/python3.9/threading.py", line 980, in _bootstrap_inner
self.run()
File "/home/liuyvjie/opt/miniforge3/envs/nndet_venv/lib/python3.9/threading.py", line 917, in run
self._target(*self._args, **self._kwargs)
File "/home/liuyvjie/opt/miniforge3/envs/nndet_venv/lib/python3.9/site-packages/batchgenerators/dataloading/multi_threaded_augmenter.py", line 92, in results_loop
raise RuntimeError("One or more background workers are no longer alive. Exiting. Please check the print"
RuntimeError: One or more background workers are no longer alive. Exiting. Please check the print statements above for the actual error message

What is the purpose of this package batchgenerators
, and will this error affect my training process and result output saving

Answer 1 · 2024-11-12T08:02:28.000Z

Dear @GarryJAY502 ,

batchgenerator is used for augmentation and data loading in nnDetection and thus is essential for proper functioanlity.

As the message already indicates: "RuntimeError: One or more background workers are no longer alive. Exiting. Please check the print statements above for the actual error message" The passage you posted does not contain the actual error, please provide the full error message.

Best,
Michael

Answer 2 · 2024-11-12T12:01:46.000Z

亲爱的@GarryJAY502，

batchgenerator 用于 nnDetection 中的增强和数据加载，因此对于正常功能至关重要。

正如消息所表明的那样：“RuntimeError：一个或多个后台工作程序不再处于活动状态。退出。请检查上面的打印语句以获取实际错误消息”您发布的段落不包含实际错误，请提供完整的错误消息。

最好的，迈克尔
thanks,Michael

The error message only contains the part shown in the figure, without specific content This happened after the last epoch of training，What tasks will nnDetection perform after this？

Answer 3 · 2024-11-12T12:31:19.000Z

Dear @GarryJAY502 ,

that is indeed curious and may be a problem within batchgenerators which might not shut down the workers correctly in combination with pytorch lightning.

nnDetection does not use batchgenerators after the training anymore. After training the empirical parameters need to be determined and whole patient inference is performed to give the final validation results.

Best,
Michael

Answer 4 · 2024-11-12T16:18:41.000Z

Dear @GarryJAY502 ,

that is indeed curious and may be a problem within batchgenerators which might not shut down the workers correctly in combination with pytorch lightning.

nnDetection does not use batchgenerators after the training anymore. After training the empirical parameters need to be determined and whole patient inference is performed to give the final validation results.

Best, Michael

thanks，Michael
Does this mean that I can ignore this error and continue with the next task? Can the folder model/Task100_LymphNodes/RetinaUNetV001-D3V001_3d still be useful, such as run nndet_comsolidate or nndet_predict.

Answer 5 · 2024-11-13T07:37:35.000Z

If the training runs through completely, you can continue. The screenshot you posted only shows epoch 1, which is definitely not sufficient; the full schedule contains 60 epochs.

Answer 6 · 2024-11-13T07:43:51.000Z

如果训练完全完成，您可以继续。您发布的屏幕截图仅显示第 1 个时期，这肯定不够；完整的时间表包含 60 个时期。
Thank you very much. And I just set epoch=1 to reproduce this error and try to debug

Answer 7 · 2024-12-14T01:13:39.000Z

This issue is stale because it has been open for 30 days with no activity.

Answer 8 · 2024-12-29T01:15:57.000Z

This issue was closed because it has been inactive for 14 days since being marked as stale.