christianpayer/MedicalDataAugmentationTool-VerSe

run main_spine_localization.py ,loss_net = nan

HongfeiJia opened this issue · 6 comments

When training main_spine_localization.py, loss_net = nan. I had check all parameter,but all this.
Do you know why? thank you

09:37:40: train iter: 0 loss_net: 0.3521 norm: 4.5986 norm_average: 9.9460 seconds: 11.332
09:38:08: train iter: 100 loss_net: nan norm: nan norm_average: 9.9460 seconds: 28.198
09:38:37: train iter: 200 loss_net: nan norm: nan norm_average: 9.9460 seconds: 29.554
09:39:06: train iter: 300 loss_net: nan norm: nan norm_average: 9.9460 seconds: 28.922
09:39:40: train iter: 400 loss_net: nan norm: nan norm_average: 9.9460 seconds: 33.664
09:40:12: train iter: 500 loss_net: nan norm: nan norm_average: 9.9460 seconds: 32.590

In verse 2020, self.image_size = [None, None,None],do I need change this code?

Hi, I noticed that you hace successfully run the code thongh loss_net = nan. I'm wondering if you have come across the bug "No module named 'utils.io'" even after having MedicalDataAugmentationTool framework downloaded in PYTHONPATH and pip install utils. Did you meet the same bug? How did you solve it? Your advice can really help!Thanks!

Hi, I noticed that you hace successfully run the code thongh loss_net = nan. I'm wondering if you have come across the bug "No module named 'utils.io'" even after having MedicalDataAugmentationTool framework downloaded in PYTHONPATH and pip install utils. Did you meet the same bug? How did you solve it? Your advice can really help!Thanks!

can i ask you some questions? i got this problem. Is this related to my data? I downloaded it from GitHub.
Output folder: ./output/spine_localization/unet/d0_25_fin/0/2023-04-28_00-39-34
2023-04-28 00:39:42.870604: W tensorflow/core/common_runtime/shape_refiner.cc:88] Function instantiation has undefined input shape at index: 9 in the outer inference context.
2023-04-28 00:39:42.870738: W tensorflow/core/common_runtime/shape_refiner.cc:88] Function instantiation has undefined input shape at index: 16 in the outer inference context.
2023-04-28 00:39:42.872210: W tensorflow/core/common_runtime/shape_refiner.cc:88] Function instantiation has undefined input shape at index: 80 in the outer inference context.
2023-04-28 00:39:42.872307: W tensorflow/core/common_runtime/shape_refiner.cc:88] Function instantiation has undefined input shape at index: 87 in the outer inference context.
Merge node with control input: StatefulPartitionedCall/unet/unet_avg_linear3d/expanding0/dropout_3/cond/branch_executed/_1267
2023-04-28 00:39:47.636940: W tensorflow/compiler/jit/xla_device.cc:398] XLA_GPU and XLA_CPU devices are deprecated and will be removed in subsequent releases. Instead, use either @tf.function(experimental_compile=True) for must-compile semantics, or run with TF_XLA_FLAGS=--tf_xla_auto_jit=2 for auto-clustering best-effort compilation.

Hi, I noticed that you hace successfully run the code thongh loss_net = nan. I'm wondering if you have come across the bug "No module named 'utils.io'" even after having MedicalDataAugmentationTool framework downloaded in PYTHONPATH and pip install utils. Did you meet the same bug? How did you solve it? Your advice can really help!Thanks!

you have a problem in your python path, please follow the readme. The tool utils.io is part of the source code.

rge node with control input: StatefulPartitionedCall/unet/unet_avg_linear3d/expanding0/dropout_3/cond/branch_executed/_1267
2023-04-28 00:39:47.636940: W tensorflow/compiler/jit/xla_device.cc:398] XLA_GPU and XLA_CPU devices are deprecated and will be removed in subsequent releases. Instead, use either @tf.function(experimental_compile=True) for must-compile semantics, or run with TF_XLA_FLAGS=--tf_xla_auto_jit=2 for auto-clustering best-ef

try to use the docker to avoid conflicts or use the exact versions described in the docker if you are running without docker

When training main_spine_localization.py, loss_net = nan. I had check all parameter,but all this. Do you know why? thank you

09:37:40: train iter: 0 loss_net: 0.3521 norm: 4.5986 norm_average: 9.9460 seconds: 11.332 09:38:08: train iter: 100 loss_net: nan norm: nan norm_average: 9.9460 seconds: 28.198 09:38:37: train iter: 200 loss_net: nan norm: nan norm_average: 9.9460 seconds: 29.554 09:39:06: train iter: 300 loss_net: nan norm: nan norm_average: 9.9460 seconds: 28.922 09:39:40: train iter: 400 loss_net: nan norm: nan norm_average: 9.9460 seconds: 33.664 09:40:12: train iter: 500 loss_net: nan norm: nan norm_average: 9.9460 seconds: 32.590

are you using the same dataset with the same preprocessing and parameters? check this out: https://sisyphus.gitbook.io/project/deep-learning-basics/deep-learning-debug/common-causes-of-nans-during-training