Dice is always zero during training

Question

Dice is always zero during training

JeyzerMC opened this issue 6 years ago · 7 comments

Hello,
Good job on the project, really impressive!

When I run the train.py with the unet2D_bn_modified_dice as experiment configuration, the training run well but the dice is always zero. I've ran a total of 27 epochs over a few hours on my machine before stopping the training process. The loss went down to about 0.75 but the dice always stays at 0 which prevent me from running evaluate_patients.py and test_loop.py.

The training generates some model_best_xent.ckpt-XXXX for cross-entropy but no best dice.

Do you know what might cause the dice to not be updated? Thank you.

--
ACDC Dataset
Tensorflow version: 1.9.0
CUDA: 9.0
Python: 3.6.5 with Anaconda

Answer 1 · 2018-08-16T08:51:37.000Z

Huh, weird. Let me look into that.

Answer 2 · 2018-08-20T16:02:59.000Z

I'm getting the same issue, DICE is always 0. Similar env as OP.

Answer 3 · 2018-08-22T08:19:23.000Z

Hi both,

If the model trains correctly the dice loss should go down to ~0.13 so it appears that for some reason the models fail to start training in your case. However, the code runs fine in my setup with Tensorflow 1.8, CUDA 9.0, Python 3.4.3 . Unfortunately, our local GPU infrastructure does not allow for TF1.9 yet, so I cannot test your problem.

I nevertheless made some changes to the code: I changed the batch norm code to tensorflow's own implementation, removed an unnecessary name scope in the dice loss, and added some checks for the preprocessed data.

You could try pulling the newest version and see if any of these changes randomly fix your problem. Otherwise, I suggest you try again after down-grading to Tensorflow 1.8.

Let me know if that fixes your issues.

Answer 4 · 2018-08-23T13:56:55.000Z

That worked! Thank you.

Any plans on releasing the pre-trained weights?

Thanks again.

Answer 5 · 2018-08-24T10:33:47.000Z

Okay great. Out of curiosity, did downgrading to tf1.8 or my changes fix the problem?

Yes, we are planning to release the weights soon.

Answer 6 · 2018-08-24T14:19:22.000Z

It was your changes that fixed the issue. Thanks again!

Answer 7 · 2020-04-04T16:17:44.000Z

I also experience the same as OP, but not for FCN experiment (dice score 0.78 after 10k steps) but for unet experiment, my dice is always zero.

windows 10
virtual env with conda to install the working version of cudatoolkit and cudnn

my step basically like this:
conda create --name tfgpu112 tensorflow-gpu=1.12
pip install -r requirements.txt
change config according to my system
run train.py

any suggestion for my workaround?
really appreciate any help