yeerwen/UniSeg

RUN ERROR

Closed this issue · 3 comments

I first downloaded the dataset of B T C V, converted it to nnUNET format and preprocessed it, in addition to this error during training, what should I do, what is the problem.

/opt/conda/conda-bld/pytorch_1699449200967/work/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:365: operator(): block: [1225,0,0], thread: [11,0,0] Assertion idx_dim >= 0 && idx_dim < index_size && "index out of bounds" failed.
/opt/conda/conda-bld/pytorch_1699449200967/work/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:365: operator(): block: [1225,0,0], thread: [12,0,0] Assertion idx_dim >= 0 && idx_dim < index_size && "index out of bounds" failed.
/opt/conda/conda-bld/pytorch_1699449200967/work/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:365: operator(): block: [1225,0,0], thread: [13,0,0] Assertion idx_dim >= 0 && idx_dim < index_size && "index out of bounds" failed.
/opt/conda/conda-bld/pytorch_1699449200967/work/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:365: operator(): block: [1225,0,0], thread: [14,0,0] Assertion idx_dim >= 0 && idx_dim < index_size && "index out of bounds" failed.
/opt/conda/conda-bld/pytorch_1699449200967/work/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:365: operator(): block: [1225,0,0], thread: [15,0,0] Assertion idx_dim >= 0 && idx_dim < index_size && "index out of bounds" failed.
/opt/conda/conda-bld/pytorch_1699449200967/work/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:365: operator(): block: [1225,0,0], thread: [16,0,0] Assertion idx_dim >= 0 && idx_dim < index_size && "index out of bounds" failed.
/opt/conda/conda-bld/pytorch_1699449200967/work/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:365: operator(): block: [1225,0,0], thread: [17,0,0] Assertion idx_dim >= 0 && idx_dim < index_size && "index out of bounds" failed.
/opt/conda/conda-bld/pytorch_1699449200967/work/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:365: operator(): block: [1225,0,0], thread: [18,0,0] Assertion idx_dim >= 0 && idx_dim < index_size && "index out of bounds" failed.
/opt/conda/conda-bld/pytorch_1699449200967/work/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:365: operator(): block: [1225,0,0], thread: [19,0,0] Assertion idx_dim >= 0 && idx_dim < index_size && "index out of bounds" failed.
/opt/conda/conda-bld/pytorch_1699449200967/work/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:365: operator(): block: [1225,0,0], thread: [20,0,0] Assertion idx_dim >= 0 && idx_dim < index_size && "index out of bounds" failed.
/opt/conda/conda-bld/pytorch_1699449200967/work/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:365: operator(): block: [1225,0,0], thread: [21,0,0] Assertion idx_dim >= 0 && idx_dim < index_size && "index out of bounds" failed.
Traceback (most recent call last):
File "/home/user_gou/anaconda3/envs/uniseg/bin/nnUNet_train", line 12, in
sys.exit(main())
File "/home/user_gou/anaconda3/envs/uniseg/lib/python3.9/site-packages/nnunet/run/run_training.py", line 179, in main
trainer.run_training()
File "/home/user_gou/anaconda3/envs/uniseg/lib/python3.9/site-packages/nnunet/training/network_training/nnUNetTrainerV2.py", line 455, in run_training
ret = super().run_training()
File "/home/user_gou/anaconda3/envs/uniseg/lib/python3.9/site-packages/nnunet/training/network_training/nnUNetTrainer.py", line 318, in run_training
super(nnUNetTrainer, self).run_training()
File "/home/user_gou/anaconda3/envs/uniseg/lib/python3.9/site-packages/nnunet/training/network_training/network_trainer.py", line 456, in run_training
l = self.run_iteration(self.tr_gen, True)
File "/home/user_gou/anaconda3/envs/uniseg/lib/python3.9/site-packages/nnunet/training/network_training/UniSeg_Trainer_DS.py", line 151, in run_iteration
l = self.loss(output, target)
File "/home/user_gou/anaconda3/envs/uniseg/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/user_gou/anaconda3/envs/uniseg/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/home/user_gou/anaconda3/envs/uniseg/lib/python3.9/site-packages/nnunet/training/loss_functions/deep_supervision.py", line 39, in forward
l = weights[0] * self.loss(x[0], y[0])
File "/home/user_gou/anaconda3/envs/uniseg/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/user_gou/anaconda3/envs/uniseg/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/home/user_gou/anaconda3/envs/uniseg/lib/python3.9/site-packages/nnunet/training/loss_functions/dice_loss.py", line 346, in forward
dc_loss = self.dc(net_output, target, loss_mask=mask) if self.weight_dice != 0 else 0
File "/home/user_gou/anaconda3/envs/uniseg/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/user_gou/anaconda3/envs/uniseg/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/home/user_gou/anaconda3/envs/uniseg/lib/python3.9/site-packages/nnunet/training/loss_functions/dice_loss.py", line 178, in forward
tp, fp, fn, _ = get_tp_fp_fn_tn(x, y, axes, loss_mask, False)
File "/home/user_gou/anaconda3/envs/uniseg/lib/python3.9/site-packages/nnunet/training/loss_functions/dice_loss.py", line 130, in get_tp_fp_fn_tn
tp = net_output * y_onehot
RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

Exception in thread Thread-5:
Traceback (most recent call last):
File "/home/user_gou/anaconda3/envs/uniseg/lib/python3.9/threading.py", line 980, in _bootstrap_inner
self.run()
File "/home/user_gou/anaconda3/envs/uniseg/lib/python3.9/threading.py", line 917, in run
self._target(*self._args, **self._kwargs)
File "/home/user_gou/anaconda3/envs/uniseg/lib/python3.9/site-packages/batchgenerators/dataloading/multi_threaded_augmenter.py", line 92, in results_loop
raise RuntimeError("One or more background workers are no longer alive. Exiting. Please check the print"
RuntimeError: One or more background workers are no longer alive. Exiting. Please check the print statements above for the actual error message
Exception in thread Thread-4:
Traceback (most recent call last):
File "/home/user_gou/anaconda3/envs/uniseg/lib/python3.9/threading.py", line 980, in _bootstrap_inner
self.run()
File "/home/user_gou/anaconda3/envs/uniseg/lib/python3.9/threading.py", line 917, in run
self._target(*self._args, **self._kwargs)
File "/home/user_gou/anaconda3/envs/uniseg/lib/python3.9/site-packages/batchgenerators/dataloading/multi_threaded_augmenter.py", line 92, in results_loop
raise RuntimeError("One or more background workers are no longer alive. Exiting. Please check the print"
RuntimeError: One or more background workers are no longer alive. Exiting. Please check the print statements above for the actual error message
已放弃 (核心已转储)

This error seems to occur when calculating the loss. Therefore, check the output size and the maximum and minimum values of the target.

This error seems to occur when calculating the loss. Therefore, check the output size and the maximum and minimum values of the target.

I feel that your project should have no problem, then I think the problem is on the dataset, below I will talk about the processing steps of my data, I did not carry out the corresponding upstream, directly do the downstream task, I first downloaded the corresponding dataset (from this URL https://www.synapse.org/#!Synapse:syn3193805/wiki/217789), Next, I converted the dataset with the Convert_BTCV_to_nnUNet_dataset.py of the project itself, and I got Task060_BTCV folder, which I put into nnUNet_raw/nnUnet_raw_data, and then I ran nnUNet_plan_and_preprocess -t 60 --verify_dataset_ integrity, and finally I ran CUDA_VISIBLE_DEVICES=0 nnUNet_n_proc_DA=32 nnUNet_train 3d_fullres UniSeg_Trainer_DS 60 0, and the error in my issue appeared, whether there was a problem with my operation, and whether the reason for the above error was because I had a problem in data processing. Or what steps am I missing.

Your data preprocessing steps are fine. Therefore, I would still recommend that you print the output size as well as the maximum and minimum values of the target to check first.