No such file or directory '/share/sda/mohammadqazi/research/MultiTalent/nnUNet_preprocessed/Task003_Liver/splits_final.pkl'

Question

No such file or directory '/share/sda/mohammadqazi/research/MultiTalent/nnUNet_preprocessed/Task003_Liver/splits_final.pkl'

Closed this issue a year ago · 4 comments

/home/mohammadqazi/.conda/envs/multitalent/lib/python3.7/site-packages/torch/distributed/launch.py:188: FutureWarning: The module torch.distributed.launch is deprecated
and will be removed in future. Use torchrun.
Note that --use_env is set by default in torchrun.
If your script expects --local_rank argument to be set, please
change it to read from os.environ['LOCAL_RANK'] instead. See
https://pytorch.org/docs/stable/distributed.html#launch-utility for
further instructions

FutureWarning,

Please cite the following paper when using nnUNet:

Isensee, F., Jaeger, P.F., Kohl, S.A.A. et al. "nnU-Net: a self-configuring method for deep learning-based biomedical image segmentation." Nat Methods (2020). https://doi.org/10.1038/s41592-020-01008-z

If you have questions or suggestions, feel free to open an issue at https://github.com/MIC-DKFZ/nnUNet

###############################################
I am running the following nnUNet: 3d_fullres
My trainer class is: <class 'nnunet.training.network_training.custom_trainers.MultiTalent.MultiTalent.MultiTalent_Trainer_DDP.MultiTalent_trainer_ddp'>
For that I will be using the following configuration:
num_classes: 24
modalities: {0: 'CT'}
use_mask_for_norm OrderedDict([(0, False)])
keep_only_largest_region None
min_region_size_per_class None
min_size_per_class None
normalization_schemes OrderedDict([(0, 'CT')])
stages...

stage: 0
{'batch_size': 4, 'num_pool_per_axis': [4, 5, 5], 'patch_size': array([ 96, 224, 192]), 'median_patient_size_in_voxels': array([190, 404, 404]), 'current_spacing': array([1.5, 1. , 1. ]), 'original_spacing': a
rray([1.5, 1. , 1. ]), 'do_dummy_2D_data_aug': False, 'pool_op_kernel_sizes': [[2, 2, 2], [2, 2, 2], [2, 2, 2], [2, 2, 2], [1, 2, 2]], 'conv_kernel_sizes': [[3, 3, 3], [3, 3, 3], [3, 3, 3], [3, 3, 3], [3, 3, 3
], [3, 3, 3]]}

I am using stage 0 from these plans
I am using sample dice + CE loss

I am using data from this folder: /share/sda/mohammadqazi/research/MultiTalent/nnUNet_preprocessed/Task100_MultiTalent/MultiTalent_data
###############################################
local rank 0
worker 0 oversample 0.33000000000000007
worker 0 batch_size 4
loading dataset
Traceback (most recent call last):
File "./nnunet/run/run_training_DDP.py", line 222, in
main()
File "./nnunet/run/run_training_DDP.py", line 181, in main
trainer.initialize(not validation_only)
File "/home/mohammadqazi/.conda/envs/multitalent/lib/python3.7/site-packages/nnunet/training/network_training/custom_trainers/MultiTalent/MultiTalent/MultiTalent_Trainer_DDP.py", line 73, in initialize
self.dl_tr, self.dl_val = self.get_basic_generators()
File "/home/mohammadqazi/.conda/envs/multitalent/lib/python3.7/site-packages/nnunet/training/network_training/custom_trainers/MultiTalent/MultiTalent/MultiTalent_Trainer_DDP.py", line 630, in get_basic_gener
ators
self.do_split()
File "/home/mohammadqazi/.conda/envs/multitalent/lib/python3.7/site-packages/nnunet/training/network_training/custom_trainers/MultiTalent/MultiTalent/MultiTalent_Trainer_DDP.py", line 451, in do_split
splits_t = load_pickle(expected_splits_file)
File "/home/mohammadqazi/.conda/envs/multitalent/lib/python3.7/site-packages/batchgenerators/utilities/file_and_folder_operations.py", line 57, in load_pickle
with open(file, mode) as f:
FileNotFoundError: [Errno 2] No such file or directory: '/share/sda/mohammadqazi/research/MultiTalent/nnUNet_preprocessed/Task003_Liver/splits_final.pkl'
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 99639) of binary: /home/mohammadqazi/.conda/envs/multitalent/bin/python
Traceback (most recent call last):
File "/home/mohammadqazi/.conda/envs/multitalent/lib/python3.7/runpy.py", line 193, in _run_module_as_main
"main", mod_spec)
File "/home/mohammadqazi/.conda/envs/multitalent/lib/python3.7/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/home/mohammadqazi/.conda/envs/multitalent/lib/python3.7/site-packages/torch/distributed/launch.py", line 195, in
main()
File "/home/mohammadqazi/.conda/envs/multitalent/lib/python3.7/site-packages/torch/distributed/launch.py", line 191, in main
launch(args)
File "/home/mohammadqazi/.conda/envs/multitalent/lib/python3.7/site-packages/torch/distributed/launch.py", line 176, in launch
run(args)
File "/home/mohammadqazi/.conda/envs/multitalent/lib/python3.7/site-packages/torch/distributed/run.py", line 756, in run
)(*cmd_args)
File "/home/mohammadqazi/.conda/envs/multitalent/lib/python3.7/site-packages/torch/distributed/launcher/api.py", line 132, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/mohammadqazi/.conda/envs/multitalent/lib/python3.7/site-packages/torch/distributed/launcher/api.py", line 248, in launch_agent
failures=result.failures,
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

./nnunet/run/run_training_DDP.py FAILED

Failures:
<NO_OTHER_FAILURES>

Root Cause (first observed failure):
[0]:
time : 2023-10-24_10:31:46
host : BioMedIA-A5000
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 99639)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

Under nnUNet_preprocessed I have only one dataset, i.e, /share/sda/mohammadqazi/research/MultiTalent/nnUNet_preprocessed/Task100_MultiTalent. How is it searching for Task003_Liver? I following the instruction to train a multitalent model. The data preprocessing was also done using the same. Please look into this.

UDA_VISIBLE_DEVICES=0 python -m torch.distributed.launch --master_port=1234 --nproc_per_node=1 ./nnunet/run/run_training_DDP.py 3d_fullres MultiTalent_trainer_ddp 100 0 -p MultiTalent_bs4 --dbs

This is the command I am running.

Thank you in advanced

Answer 1 · 2023-10-24T08:33:43.000Z

Hi,
we wanted to evaluate MultiTalent on the same split files as nnUNet would use for the exact datasets. You could just copy the splits_custom.pkl file in the corresponding MultiTalent preprocessed folder.
Or you would need to preprocess all datasets individually and start a default nnunet training for each dataset (would not recommend that)

Answer 2 · 2023-10-24T10:19:31.000Z

But after following the instruction for training, the preprocssing folder has only Task100_MultiTalent. Should I ran "nnUNet_plan_and_preprocess -t 100 -pl3d ExperimentPlanner3D_v21_MultiTalent -pl2d None -tf 16 --verify_dataset_integrity -overwrite_plans_identifier multitalent_bs4" for every dataset to get a folder in preprocessing? And then copy the splits_custom.pkl into them.

And the do_split() function in the Multitalent_trainer_DDP, takes the task_id from the dataset item, which returns 3 since 003 is the first image
(for task_id in np.unique([int(i.split("_")[0]) for i in keys]):).

Hence it looks for splits_custom.pkl in the third task, which is not present in the preprocessing folder.

Answer 3 · 2023-10-24T11:51:54.000Z

But after following the instruction for training, the preprocssing folder has only Task100_MultiTalent. Should I ran "nnUNet_plan_and_preprocess -t 100 -pl3d ExperimentPlanner3D_v21_MultiTalent -pl2d None -tf 16 --verify_dataset_integrity -overwrite_plans_identifier multitalent_bs4" for every dataset to get a folder in preprocessing? And then copy the splits_custom.pkl into them.

No just copy splits_custom.pkl into Task100_MultiTalent preprocessing folder. That should work.

Hence it looks for splits_custom.pkl in the third task, which is not present in the preprocessing folder.

As I said, I wanted to have the same splits as nnunet for the individual datasets. Therefore, I first created a split file for each dataset with the default nnUnet preprocessing and training: Based on these individual split files, i created the splits_custom.pkl file for MultiTalent. But you can simply copy my splits custom.pkl file, then you save the preprocessing of the individual datasets.

Answer 4 · 2023-10-24T14:19:50.000Z

Thank you. It is working now.

./nnunet/run/run_training_DDP.py FAILED

Failures: <NO_OTHER_FAILURES>

Failures:
<NO_OTHER_FAILURES>