bowang-lab/U-Mamba

RuntimeError: One or more background workers are no longer alive.

Opened this issue · 4 comments

Hi all, when I start training in the Windows environment, I get this error information. Eventhough I have tried the solution from MIC-DKFZ/nnUNet#1343 from the original nnUNet by setting the environment OMP_NUM_THREADS=1, It still not be solved.
Thank you in advance for your help!

`This is the configuration used by this training:
Configuration name: 2d
{'data_identifier': 'nnUNetPlans_2d', 'preprocessor_name': 'DefaultPreprocessor', 'batch_size': 14, 'patch_size': [512, 448], 'median_image_size_in_voxels': [512.0, 512.0], 'spacing': [0.7958984971046448, 0.7958984971046448], 'normalization_schemes': ['CTNormalization'], 'use_mask_for_norm': [False], 'UNet_class_name': 'PlainConvUNet', 'UNet_base_num_features': 32, 'n_conv_per_stage_encoder': [2, 2, 2, 2, 1, 1, 1], 'n_conv_per_stage_decoder': [2, 2, 2, 2, 1, 1], 'num_pool_per_axis': [6, 6], 'pool_op_kernel_sizes': [[1, 1], [2, 2], [2, 2], [2, 2], [2, 2], [2, 2], [2, 2]], 'conv_kernel_sizes': [[3, 3], [3, 3], [3, 3], [3, 3], [3, 3], [3, 3], [3, 3]], 'unet_max_num_features': 512, 'resampling_fn_data': 'resample_data_or_seg_to_shape', 'resampling_fn_seg': 'resample_data_or_seg_to_shape', 'resampling_fn_data_kwargs': {'is_seg': False, 'order': 3, 'order_z': 0, 'force_separate_z': None}, 'resampling_fn_seg_kwargs': {'is_seg': True, 'order': 1, 'order_z': 0, 'force_separate_z': None}, 'resampling_fn_probabilities': 'resample_data_or_seg_to_shape', 'resampling_fn_probabilities_kwargs': {'is_seg': False, 'order': 1, 'order_z': 0, 'force_separate_z': None}, 'batch_dice': True}

These are the global plan.json settings:
{'dataset_name': 'Dataset701_AbdomenCT', 'plans_name': 'nnUNetPlans', 'original_median_spacing_after_transp': [2.5, 0.7958984971046448, 0.7958984971046448], 'original_median_shape_after_transp': [97, 512, 512], 'image_reader_writer': 'SimpleITKIO', 'transpose_forward': [0, 1, 2], 'transpose_backward': [0, 1, 2], 'experiment_planner_used': 'ExperimentPlanner', 'label_manager': 'LabelManager', 'foreground_intensity_properties_per_channel': {'0': {'max': 3071.0, 'mean': 97.29691314697266, 'median': 118.0, 'min': -1024.0, 'percentile_00_5': -958.0, 'percentile_99_5': 270.0, 'std': 137.85003662109375}}}

2024-07-24 17:20:43.049483: unpacking dataset...
2024-07-24 17:20:43.598747: unpacking done...
2024-07-24 17:20:43.599747: do_dummy_2d_data_aug: False
2024-07-24 17:20:43.666747: Unable to plot network architecture:
2024-07-24 17:20:43.666747: No module named 'hiddenlayer'
2024-07-24 17:20:43.759725:
2024-07-24 17:20:43.760716: Epoch 0
2024-07-24 17:20:43.761715: Current learning rate: 0.01
using pin_memory on device 0
Traceback (most recent call last):
File "\?\C:\ProgramData\Anaconda3\envs\umamba\Scripts\nnUNetv2_train-script.py", line 33, in
sys.exit(load_entry_point('nnunetv2', 'console_scripts', 'nnUNetv2_train')())
File "f:\u-mamba-main\umamba\nnunetv2\run\run_training.py", line 268, in run_training_entry
run_training(args.dataset_name_or_id, args.configuration, args.fold, args.tr, args.p, args.pretrained_weights,
File "f:\u-mamba-main\umamba\nnunetv2\run\run_training.py", line 204, in run_training
nnunet_trainer.run_training()
File "f:\u-mamba-main\umamba\nnunetv2\training\nnUNetTrainer\nnUNetTrainer.py", line 1258, in run_training
train_outputs.append(self.train_step(next(self.dataloader_train)))
File "f:\u-mamba-main\umamba\nnunetv2\training\nnUNetTrainer\nnUNetTrainer.py", line 900, in train_step
output = self.network(data)
File "C:\ProgramData\Anaconda3\envs\umamba\lib\site-packages\torch\nn\modules\module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "C:\ProgramData\Anaconda3\envs\umamba\lib\site-packages\torch\nn\modules\module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "f:\u-mamba-main\umamba\nnunetv2\nets\UMambaBot_2d.py", line 432, in forward
skips[-1] = self.mamba_layer(skips[-1])
File "C:\ProgramData\Anaconda3\envs\umamba\lib\site-packages\torch\nn\modules\module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "C:\ProgramData\Anaconda3\envs\umamba\lib\site-packages\torch\nn\modules\module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "C:\ProgramData\Anaconda3\envs\umamba\lib\site-packages\torch\amp\autocast_mode.py", line 16, in decorate_autocast
return func(*args, **kwargs)
File "f:\u-mamba-main\umamba\nnunetv2\nets\UMambaBot_2d.py", line 61, in forward
x_mamba = self.mamba(x_norm)
File "C:\ProgramData\Anaconda3\envs\umamba\lib\site-packages\torch\nn\modules\module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "C:\ProgramData\Anaconda3\envs\umamba\lib\site-packages\torch\nn\modules\module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "C:\ProgramData\Anaconda3\envs\umamba\lib\site-packages\mamba_ssm\modules\mamba_simple.py", line 146, in forward
out = mamba_inner_fn(
File "C:\ProgramData\Anaconda3\envs\umamba\lib\site-packages\mamba_ssm\ops\selective_scan_interface.py", line 317, in mamba_inner_fn
return MambaInnerFn.apply(xz, conv1d_weight, conv1d_bias, x_proj_weight, delta_proj_weight,
File "C:\ProgramData\Anaconda3\envs\umamba\lib\site-packages\torch\autograd\function.py", line 539, in apply
return super().apply(*args, **kwargs) # type: ignore[misc]
File "C:\ProgramData\Anaconda3\envs\umamba\lib\site-packages\torch\cuda\amp\autocast_mode.py", line 113, in decorate_fwd
return fwd(*args, **kwargs)
File "C:\ProgramData\Anaconda3\envs\umamba\lib\site-packages\mamba_ssm\ops\selective_scan_interface.py", line 187, in forward
conv1d_out = causal_conv1d_cuda.causal_conv1d_fwd(
TypeError: causal_conv1d_fwd(): incompatible function arguments. The following argument types are supported:
1. (arg0: torch.Tensor, arg1: torch.Tensor, arg2: Optional[torch.Tensor], arg3: Optional[torch.Tensor], arg4: bool) -> torch.Tensor

Invoked with: tensor([[[-0.3531, -0.3256, -0.5120, ..., -0.3845, -0.3780, -0.2731],
[-0.1226, 0.0515, 0.0443, ..., -0.0484, -0.0954, 0.2243],
[ 0.2591, 0.4765, 0.4899, ..., 0.2762, 0.2085, 0.1601],
...,
[-0.4706, 0.0122, -0.0670, ..., -0.6855, -1.0694, -0.7547],
[ 0.2710, 0.6020, 0.5813, ..., 0.0339, 0.0822, 0.5069],
[-0.0817, 0.1549, 0.1879, ..., -0.1216, -0.4358, -0.3873]],

    [[-0.7350, -0.6563, -0.6970,  ..., -0.5548, -0.2491, -0.3194],
     [-0.3465, -0.6268, -0.4854,  ...,  0.2556,  0.1076,  0.1940],
     [ 0.0645,  0.5889,  0.7408,  ...,  0.4412,  0.1118,  0.2022],
     ...,
     [-0.7669, -0.8219, -0.9606,  ..., -0.6517, -0.6021, -0.7447],
     [ 0.6877,  0.3808,  0.4204,  ...,  0.2805,  0.3491,  0.3867],
     [ 0.1577,  0.0902,  0.0191,  ..., -0.5127, -0.3992, -0.4217]],

    [[-0.6899, -0.6800, -0.7939,  ..., -0.2452, -0.2823, -0.2156],
     [-0.2452, -0.2569, -0.4180,  ...,  0.2565,  0.3105,  0.2020],
     [ 0.4328,  0.6825,  0.6242,  ...,  0.2382,  0.2548,  0.2945],
     ...,
     [-0.5348, -0.4934, -0.6218,  ..., -0.8466, -0.8843, -0.9299],
     [ 0.1885,  0.4097,  0.3503,  ...,  0.5430,  0.5202,  0.5581],
     [-0.4576, -0.3852, -0.5572,  ..., -0.4343, -0.5026, -0.4852]],

    ...,

    [[-0.3982, -0.6243, -0.6702,  ..., -0.2997, -0.0544, -0.6496],
     [-0.3635, -0.3576, -0.4177,  ...,  0.1261,  0.1114,  0.0181],
     [ 0.3839,  0.7153,  0.7155,  ...,  0.2303,  0.1457, -0.1998],
     ...,
     [-0.6408, -0.5035, -0.6167,  ..., -0.6473, -0.4699, -0.2966],
     [ 0.3132,  0.4346,  0.4209,  ...,  0.0756,  0.2835,  0.2599],
     [-0.2990, -0.3384, -0.4100,  ...,  0.0843, -0.1040, -0.0645]],

    [[-0.4619, -0.7534, -0.7760,  ..., -0.5952, -0.3705, -0.3551],
     [-0.1528, -0.3495, -0.3650,  ...,  0.0889,  0.2627,  0.0885],
     [ 0.5250,  0.7301,  0.7312,  ...,  0.2815,  0.2979,  0.2394],
     ...,
     [-0.6124, -0.5625, -0.6515,  ..., -0.4177, -0.9805, -0.9586],
     [ 0.3327,  0.3848,  0.4037,  ...,  0.0295,  0.4747,  0.5617],
     [-0.3875, -0.3905, -0.4910,  ..., -0.0437, -0.5517, -0.5322]],

    [[-0.5744, -0.5597, -0.6744,  ..., -0.4591, -0.5266, -0.3234],
     [-0.2457, -0.3103, -0.3841,  ...,  0.0146,  0.0279,  0.0058],
     [ 0.5145,  0.6709,  0.6334,  ...,  0.0854,  0.1010,  0.3496],
     ...,
     [-0.6111, -0.6036, -0.6492,  ..., -0.6807, -0.6825, -0.8804],
     [ 0.2965,  0.4934,  0.4702,  ...,  0.5427,  0.5108,  0.7819],
     [-0.3857, -0.3858, -0.3655,  ..., -0.4994, -0.5220, -0.0722]]],
   device='cuda:0', requires_grad=True), tensor([[ 0.2771, -0.4502,  0.2234,  0.4393],
    [-0.2371,  0.0904,  0.3013,  0.2585],
    [-0.2705,  0.0695,  0.4170, -0.1234],
    ...,
    [ 0.3458, -0.2377, -0.4476,  0.1447],
    [ 0.4869,  0.3001, -0.4930,  0.0575],
    [ 0.4755, -0.2672,  0.3849, -0.0855]], device='cuda:0',
   requires_grad=True), Parameter containing:

tensor([-0.0066, -0.3897, 0.1920, ..., 0.1256, -0.0983, -0.4903],
device='cuda:0', requires_grad=True), None, None, None, True
Exception in thread Thread-4 (results_loop):
Traceback (most recent call last):
File "C:\ProgramData\Anaconda3\envs\umamba\lib\threading.py", line 1016, in _bootstrap_inner
self.run()
File "C:\ProgramData\Anaconda3\envs\umamba\lib\threading.py", line 953, in run
self._target(*self._args, **self._kwargs)
File "C:\ProgramData\Anaconda3\envs\umamba\lib\site-packages\batchgenerators\dataloading\nondet_multi_threaded_augmenter.py", line 125, in results_loop
raise e
File "C:\ProgramData\Anaconda3\envs\umamba\lib\site-packages\batchgenerators\dataloading\nondet_multi_threaded_augmenter.py", line 103, in results_loop
raise RuntimeError("One or more background workers are no longer alive. Exiting. Please check the "
RuntimeError: One or more background workers are no longer alive. Exiting. Please check the print statements above for the actual error message`

我也遇到同样的问题,请问你是否已经解决?

我也遇到同样的问题,请问你是否已经解决?CUDA_VISIBLE_DEVICES=1 nnUNetv2_train 11 3d_fullres 0

############################
INFO: You are using the old nnU-Net default plans. We have updated our recommendations. Please consider using those instead! Read more here: https://github.com/MIC-DKFZ/nnUNet/blob/master/documentation/resenc_presets.md
############################

Using device: cuda:0

#######################################################################
Please cite the following paper when using nnU-Net:
Isensee, F., Jaeger, P. F., Kohl, S. A., Petersen, J., & Maier-Hein, K. H. (2021). nnU-Net: a self-configuring method for deep learning-based biomedical image segmentation. Nature methods, 18(2), 203-211.
#######################################################################

2024-07-28 01:12:55.969214: do_dummy_2d_data_aug: True
2024-07-28 01:12:55.970464: Using splits from existing split file: /gpfs/share/home/2301210659/tools/nnunet_v2/dataset/nnUNet_preprocessed/Dataset011_T-tubule/splits_final.json
2024-07-28 01:12:55.971022: The split file contains 5 splits.
2024-07-28 01:12:55.971232: Desired fold for training: 0
2024-07-28 01:12:55.971411: This split has 4 training and 1 validation cases.
using pin_memory on device 0
Exception in background worker 3:
local variable 'region_labels' referenced before assignment
Traceback (most recent call last):
File "/gpfs/share/home/2301210659/.conda/envs/nnunetv2/lib/python3.9/site-packages/batchgenerators/dataloading/nondet_multi_threaded_augmenter.py", line 53, in producer
item = next(data_loader)
File "/gpfs/share/home/2301210659/.conda/envs/nnunetv2/lib/python3.9/site-packages/batchgenerators/dataloading/data_loader.py", line 126, in next
return self.generate_train_batch()
File "/gpfs/share/home/2301210659/tools/nnunet_v2/nnUNet/nnunetv2/training/dataloading/data_loader_3d.py", line 61, in generate_train_batch
tmp = self.transforms(**{'image': data_all[b], 'segmentation': seg_all[b]})
File "/gpfs/share/home/2301210659/.conda/envs/nnunetv2/lib/python3.9/site-packages/batchgeneratorsv2/transforms/base/basic_transform.py", line 18, in call
return self.apply(data_dict, **params)
File "/gpfs/share/home/2301210659/.conda/envs/nnunetv2/lib/python3.9/site-packages/batchgeneratorsv2/transforms/utils/compose.py", line 13, in apply
data_dict = t(**data_dict)
File "/gpfs/share/home/2301210659/.conda/envs/nnunetv2/lib/python3.9/site-packages/batchgeneratorsv2/transforms/base/basic_transform.py", line 18, in call
return self.apply(data_dict, **params)
File "/gpfs/share/home/2301210659/.conda/envs/nnunetv2/lib/python3.9/site-packages/batchgeneratorsv2/transforms/base/basic_transform.py", line 67, in apply
data_dict['segmentation'] = self._apply_to_segmentation(data_dict['segmentation'], **params)
File "/gpfs/share/home/2301210659/.conda/envs/nnunetv2/lib/python3.9/site-packages/batchgeneratorsv2/transforms/utils/seg_to_regions.py", line 17, in _apply_to_segmentation
if isinstance(region_labels, int) or len(region_labels) == 1:
UnboundLocalError: local variable 'region_labels' referenced before assignment
Traceback (most recent call last):
File "/gpfs/share/home/2301210659/.conda/envs/nnunetv2/bin/nnUNetv2_train", line 8, in
sys.exit(run_training_entry())
File "/gpfs/share/home/2301210659/tools/nnunet_v2/nnUNet/nnunetv2/run/run_training.py", line 275, in run_training_entry
run_training(args.dataset_name_or_id, args.configuration, args.fold, args.tr, args.p, args.pretrained_weights,
File "/gpfs/share/home/2301210659/tools/nnunet_v2/nnUNet/nnunetv2/run/run_training.py", line 211, in run_training
nnunet_trainer.run_training()
File "/gpfs/share/home/2301210659/tools/nnunet_v2/nnUNet/nnunetv2/training/nnUNetTrainer/nnUNetTrainer.py", line 1362, in run_training
self.on_train_start()
File "/gpfs/share/home/2301210659/tools/nnunet_v2/nnUNet/nnunetv2/training/nnUNetTrainer/nnUNetTrainer.py", line 903, in on_train_start
self.dataloader_train, self.dataloader_val = self.get_dataloaders()
File "/gpfs/share/home/2301210659/tools/nnunet_v2/nnUNet/nnunetv2/training/nnUNetTrainer/nnUNetTrainer.py", line 696, in get_dataloaders
_ = next(mt_gen_train)
File "/gpfs/share/home/2301210659/.conda/envs/nnunetv2/lib/python3.9/site-packages/batchgenerators/dataloading/nondet_multi_threaded_augmenter.py", line 196, in next
item = self.__get_next_item()
File "/gpfs/share/home/2301210659/.conda/envs/nnunetv2/lib/python3.9/site-packages/batchgenerators/dataloading/nondet_multi_threaded_augmenter.py", line 181, in __get_next_item
raise RuntimeError("One or more background workers are no longer alive. Exiting. Please check the "
RuntimeError: One or more background workers are no longer alive. Exiting. Please check the print statements above for the actual error message

我也遇到同样的问题,请问你是否已经解决?CUDA_VISIBLE_DEVICES=1 nnUNetv2_train 11 3d_fullres 0

############################ INFO: You are using the old nnU-Net default plans. We have updated our recommendations. Please consider using those instead! Read more here: https://github.com/MIC-DKFZ/nnUNet/blob/master/documentation/resenc_presets.md ############################

Using device: cuda:0

####################################################################### Please cite the following paper when using nnU-Net: Isensee, F., Jaeger, P. F., Kohl, S. A., Petersen, J., & Maier-Hein, K. H. (2021). nnU-Net: a self-configuring method for deep learning-based biomedical image segmentation. Nature methods, 18(2), 203-211. #######################################################################

2024-07-28 01:12:55.969214: do_dummy_2d_data_aug: True 2024-07-28 01:12:55.970464: Using splits from existing split file: /gpfs/share/home/2301210659/tools/nnunet_v2/dataset/nnUNet_preprocessed/Dataset011_T-tubule/splits_final.json 2024-07-28 01:12:55.971022: The split file contains 5 splits. 2024-07-28 01:12:55.971232: Desired fold for training: 0 2024-07-28 01:12:55.971411: This split has 4 training and 1 validation cases. using pin_memory on device 0 Exception in background worker 3: local variable 'region_labels' referenced before assignment Traceback (most recent call last): File "/gpfs/share/home/2301210659/.conda/envs/nnunetv2/lib/python3.9/site-packages/batchgenerators/dataloading/nondet_multi_threaded_augmenter.py", line 53, in producer item = next(data_loader) File "/gpfs/share/home/2301210659/.conda/envs/nnunetv2/lib/python3.9/site-packages/batchgenerators/dataloading/data_loader.py", line 126, in next return self.generate_train_batch() File "/gpfs/share/home/2301210659/tools/nnunet_v2/nnUNet/nnunetv2/training/dataloading/data_loader_3d.py", line 61, in generate_train_batch tmp = self.transforms(**{'image': data_all[b], 'segmentation': seg_all[b]}) File "/gpfs/share/home/2301210659/.conda/envs/nnunetv2/lib/python3.9/site-packages/batchgeneratorsv2/transforms/base/basic_transform.py", line 18, in call return self.apply(data_dict, **params) File "/gpfs/share/home/2301210659/.conda/envs/nnunetv2/lib/python3.9/site-packages/batchgeneratorsv2/transforms/utils/compose.py", line 13, in apply data_dict = t(**data_dict) File "/gpfs/share/home/2301210659/.conda/envs/nnunetv2/lib/python3.9/site-packages/batchgeneratorsv2/transforms/base/basic_transform.py", line 18, in call return self.apply(data_dict, **params) File "/gpfs/share/home/2301210659/.conda/envs/nnunetv2/lib/python3.9/site-packages/batchgeneratorsv2/transforms/base/basic_transform.py", line 67, in apply data_dict['segmentation'] = self._apply_to_segmentation(data_dict['segmentation'], **params) File "/gpfs/share/home/2301210659/.conda/envs/nnunetv2/lib/python3.9/site-packages/batchgeneratorsv2/transforms/utils/seg_to_regions.py", line 17, in _apply_to_segmentation if isinstance(region_labels, int) or len(region_labels) == 1: UnboundLocalError: local variable 'region_labels' referenced before assignment Traceback (most recent call last): File "/gpfs/share/home/2301210659/.conda/envs/nnunetv2/bin/nnUNetv2_train", line 8, in sys.exit(run_training_entry()) File "/gpfs/share/home/2301210659/tools/nnunet_v2/nnUNet/nnunetv2/run/run_training.py", line 275, in run_training_entry run_training(args.dataset_name_or_id, args.configuration, args.fold, args.tr, args.p, args.pretrained_weights, File "/gpfs/share/home/2301210659/tools/nnunet_v2/nnUNet/nnunetv2/run/run_training.py", line 211, in run_training nnunet_trainer.run_training() File "/gpfs/share/home/2301210659/tools/nnunet_v2/nnUNet/nnunetv2/training/nnUNetTrainer/nnUNetTrainer.py", line 1362, in run_training self.on_train_start() File "/gpfs/share/home/2301210659/tools/nnunet_v2/nnUNet/nnunetv2/training/nnUNetTrainer/nnUNetTrainer.py", line 903, in on_train_start self.dataloader_train, self.dataloader_val = self.get_dataloaders() File "/gpfs/share/home/2301210659/tools/nnunet_v2/nnUNet/nnunetv2/training/nnUNetTrainer/nnUNetTrainer.py", line 696, in get_dataloaders _ = next(mt_gen_train) File "/gpfs/share/home/2301210659/.conda/envs/nnunetv2/lib/python3.9/site-packages/batchgenerators/dataloading/nondet_multi_threaded_augmenter.py", line 196, in next item = self.__get_next_item() File "/gpfs/share/home/2301210659/.conda/envs/nnunetv2/lib/python3.9/site-packages/batchgenerators/dataloading/nondet_multi_threaded_augmenter.py", line 181, in __get_next_item raise RuntimeError("One or more background workers are no longer alive. Exiting. Please check the " RuntimeError: One or more background workers are no longer alive. Exiting. Please check the print statements above for the actual error message

我已经解决问题,我想应该很多问题最终都归于“One or more background workers...",所以你尝试追踪到上面的traceback. 我的问题是重新安装回那些需要的package. 不如你尝试3.10版本吧(因为我看见作者推荐3.10版本)

Hi everyone, maybe this can help you. #56 (comment)