Evaluation Metrics
infusion-zero-edit opened this issue · 17 comments
There is no clear instructions in the repo how to calculate and verify the metrics published in the paper, neither it has been calculated in training and validation step only the input and denoised images are saved as results in the experiments folder.
There is no reference of metrics PSNR and SSIM as defined in the file: https://github.com/StanfordMIMI/DDM2/blob/4f5a551a7f16e18883e3bf7451df7b46e691236d/core/metrics.py
Can you please add instructions to calculated the metrics after third stage of training is completed.
Hi
@tiangexiang can you please clarify how one can get the results reported in the paper on test data after the training stage III is over?
Hi, thanks for your interests! The metrics were calculated based on this script provided by DIPY (https://dipy.org/documentation/1.1.0./examples_built/snr_in_cc/) since it provides the foreground regions to calculate SNR/CNR. Note that we only reported the metrics on Stanford HARDI dataset by following the steps in the DIPY script.
We will manage to release our evaluation script in a few days!
Thank you. Waiting eagerly for the evaluation script. Also for the Sherbrooke dataset, is their a way to compute any quantitative metric on it as well? Can we use, https://dipy.org/documentation/1.1.0./examples_built/snr_in_cc/
to evaluate Sherbrooke test data as well?
I think the primary concern is the definition of foreground ROI, which will be used to calculate SNR/CNR. Without having this medical expertise to know which region should be defined as ROI, we didn't try to calculate metrics on other datasets. If your team is able to localize the ROIs for different datasets precisely, you can directly use the same script :)
Hi @tiangexiang
Can you please update the metric code. Thanks.
Hi, our script for the quantitative metric calculation is uploaded. Please see README for details :) Note that in the notebook, we tested a different set of denoised data, which yield slightly different scores than the ones we reported in the paper.
Thank you for prompt response @tiangexiang.
HI @tiangexiang , i have ran your notebook and the denoised save size is (81, 106, 76, 1), it gives error at SNR = SNR[sel_b] saying sel_b of size 160 and SNR of size 11. Can you help please ?
SNR = mean_signal_denoised[k] / (denoised_noise_std[k]+1e-7)
CNR = (mean_signal_denoised[k] - denoised_mean_bg[k]) / (denoised_noise_std[k]+1e-7)
SNR = SNR[sel_b]
Hi @tiangexiang i have run the denoising script for all the slices i have seen your code where it is taking only one slice the 32th one, after that the evaluation metrics code is running fine, but following your steps in the github repo we are getting following results:
raw [SNR] mean: 5.1141 std: 2.4988
raw [CNR] mean: 4.6567 std: 2.4976
our [SNR delta] mean: 1.0223 std: 1.3709 best: 3.8002 worst: -1.7891
our [CNR delta] mean: 0.9643 std: 1.3711 best: 3.7478 worst: -1.8394
The results which is reported in evaluation_metrics notebook is different
our [SNR delta] mean: 1.8284 std: 1.6969 best: 4.9025 worst: -1.7451
our [CNR delta] mean: 1.7486 std: 1.6949 best: 4.8205 worst: -1.8113
The box plot plotted in notebook with these results is not matching with the box plot in the paper
So we are not sure how to arrive on the results reported in the paper. Please help.
Hi @anantkha , sorry for the unclearness. The evaluation script needs to run on ALL the slices for ALL the volumes (except for b=0 volumes). In 'denoise.py' we set the slice index to 32 just for a quick demo, you need to change 32 to 'all' to denoise on all slices in order to calculate the metrics. This can take a relatively long time :)
After denoising ALL non-b0 volumes, you need to append the original b0 volumes into the denoised results. Don't worry about different intensity scales, there is a normalization step in the notebook.
Lastly, as explained in an earlier thread, the denoised results we provided are different from the ones used in the paper, therefore the metric scores could be a bit different as well.
yes @tiangexiang i have ran on all the slices, and that takes relatively longer time and have calculated results basis that, but they are not matching with the paper results? any thoughts i have followed the exact same steps as per your github repo there is a huge difference we are getting as per the box plot in paper the mean should be around 5 but we are getting only 1.02 as the mean
yes so after saving the denoised results the shape is (81, 106, 76, 150) and the rest 10 of them is concatenated as b0 volumes and calculated results basis that. Still we are not getting the same results as in the paper. is there any way we can get the exact same results as in the paper
Frankly, I am also not sure why your quantitative scores are this low. It could be variations from generations, could be something wrong during training, could also be a problem with software/hardware versions, etc. Can you please double-check the visual quality? And make sure it is reasonable throughout the training of Stage III.
We ran the code under our software/hardware environment multiple times, and we can get similar results at every time. At least, the 1.02 delta SNR is still better than all other comparison methods.
GPU USED Nvidia Tesla V100 32 GB
I also ran this two times, but observed the training happens quite fast the stage 1 training completed in 15 minutes and stage ||| training completed in two-three hours all without any error if you want i can share the logs with you. I just wanted you to check the training scripts if they contains any hard constraint like the one we found in the denoising script which is calculating only on 32th slice it might happen during training such kind of constraint is there because of which training happens quite fast which should not happen ideally because here we train the diffusion model not a MLP network.
The inference of all the slices take quite a long time i am wondering how the training happens quite fast. Requesting you to please check the training scripts if there any observable constraint there. Thanks
Yeah you are right, the training is abnormally fast. Let's both go through the script to see if there are suspicious constraints/errors. I will keep you posted whenever I make an update! Thank you!
I experienced a similar problem. I trained fairly quickly on phases 1 and 2, with phase 3 consuming about 2 hours. The evaluation was significantly lower than the results of the paper. My GPU is Nvidia RTX 2080Ti. Due to some compatibility issues, I changed the version of some packages , I'm not sure if that causes a difference in the results.