Evaluation of ms lesion segmentation algorithm
plbenveniste opened this issue · 19 comments
Opening this issue to discuss the results and the performances of the models trained to segment MS lesions in the spinal cord.
Discussing the results of the model output stored on ms_lesion_agnostic/results/2024-04-21_16/06/04.890513
Here is the model performance on the training set (data from canproco and sct-testing-large)
Here are the detailed observations:
- Lesions GT segmentations that should be corrected
- sub-edm181_ses-M0_PSIR
- sub-cal156_ses-M0_STIR
- sub-cal104_ses-M0_STIR: lesions are too small to be kept
- sub-mon004_ses-M0_PSIR
- sub-mon010_ses-M12_PSIR (lesions are too small): either remove them on the lesion segmentation or remove during pre-processing
- sub-mon137_ses-M0_PSIR (to be discussed)
- sub-karo1898_acq-sagcerv_T2star (to be discussed)
- sub-karo2032_acq-sagcerv_T2star
- sub-karo2039_acq-sagcerv_T2star
- sub-nyuShepherd022_acq-sup_T2w
- Identified problems:
- The model tends to segment lesions which are too small. Problem: some input data contains lesion segmentations which are very small. Solution: either remove those lesions on the lesion segmentation directly or remove them during pre-processing or remove them after model prediction?
- The model doesn’t segment lesions close to the brain. (ex: sub-amuVirginie009_T2w, sub-bwh026_acq-sagstir_T2w, sub-lyonOfsep004_acq-sag_T2w, sub-lyonOfsep082_acq-sag_T2w, sub-milanFilippi064_acq-sag_T2w)
- In larger fields of view than canproco, model usually doesn't segment lesions close to the brain, but still segments lesions in the cervical/thoracic spinal cord: (sub-lyonOfsep001_acq-sag_T2w)
- The subject sub-lyonOfsep079_acq-sag_T2w : should be excluded as problem when reconstructing the image
- Rare issue of model segmenting outside the spinal cord : sub-rennesMS018_acq-sagthor_T2w, sub-rennesMS027_acq-sagthor_T2w, sub-rennesMS049_acq-sagthor_T2w: maybe because spinal cord is not centered. The model tends to segment in the middle (on the vertical axis)
Conclusion:
- Because of the use of
RandCropByPosNegLabeld
, the model tends to segment in the middle of the spinal cord (on the superior-inferior axis): this can be solved by modifying/removing this DA strategy. - Lesions which are below a certain volume (to be defined) should not be kept during model training and during inference.
Here is a graph presenting the results of the investigations done:
This shows a comparison between:
- the current state of the art model on which the above observations were done (called BestAttention+spacing1;0)
- the SOTA model without the RandCropByPosNeg augmentation function
- the SOTA model without the RandCropByPosNeg augmentation function and no skipping of empty patch
- the SOTA model with a SpatialCrop function on top of RandCropByPosNeg to add variability
- the SOTA model with the removeSmall augmentation function
As we can see on the graphs, the explorations did not outperform the current SOTA model.
As suggested by @jcohenadad exploring the nnUNet data augmentation strategies could help : link to data augmentation
Very interesting investigations @plbenveniste !
The current investigation is to train a MedNext model for MS lesion segmentation:
- The first investigation was to train a MedNext model with a low number of channels : 3 (to see if it works). It had a weird behavior as we can see on the following graph (comparison with the SOTA model in violet and the low channel MedNext model).
-
The current MedNext model is with the highest number of channel possible (n_channels=16) due to the GPU limitations (using 42GB out of 47GB). It is currently training, but it seems that it is slightly under-performing the SOTA model.
-
The next investigation I want to do is to increase the number of channel in MedNext and, to do so, reduce the number of samples produced by RandCropByPosNeg (num_samples=4).
Here is the output of the investigation of increasing the size of the MedNext model:
I am currently running some experiments on all the data after it was reformated. The input data is the following (stored on moneta): /home/plbenveniste/net/ms-lesion-agnostic/msd_data/dataset_2024-06-26_seed42_lesionOnly.json
One thing I had to do was set num_workers=0
when loading the validation dataset otherwise the training would crash with entire dataset when doing the first validation step.
Here are my findings:
- decreasing the resampling to iso 0.5 mm (instead of 0.7) decreases the model performances
- increasing the AttUnet dropout ratio to 0.2 (instead of 0.1) decreases the performances
- it seems that increasing model depth from [32, 64, 128, 256, 512] to [32, 64, 128, 256, 512, 1024] didn't affect the model performance.
Other findings:
- using DiceFocalLoss made training fail
- using GeneralizedDiceFocalLoss made training fail
- using a model depth which ranged from 64 to 1024 improved slightly (~2%) the model performances: however, it requires a lot more memory. This made not possible training with RandCropByPosNeg with a num_sample greater than 2.
For now, my conclusions are:
- the best resampling resolution is 0.7 isotropic
- the nnUNet data augmentation didn't improve the model performance
- the best model architecture is AttUnet with depth [64, 128, 256, 512, 1024]
TODO:
- Use the best model to perform inference on the training and validation images.
- Identify the images on which the model is poorly performing
- Manually correct these images and/or remove than from the training/validation (using an exclude.yml file for instance)
- Retrain the model
The best model is stored in /home/plbenveniste/net/ms-lesion-agnostic/results/2024-07-18_10:46:21.634514/
it is called BestAtt+allData_06-26
in wandb.
The inference was performed on the trainining, validation and testing set using the following command on kronos:
conda activate venv_monai
CUDA_VISIBLE_DEVICES=1 python ms-lesion-agnostic/monai/test_model.py --config ms-lesion-agnostic/monai/config_test.yml --data-split train
CUDA_VISIBLE_DEVICES=2 python ms-lesion-agnostic/monai/test_model.py --config ms-lesion-agnostic/monai/config_test.yml --data-split test
CUDA_VISIBLE_DEVICES=3 python ms-lesion-agnostic/monai/test_model.py --config ms-lesion-agnostic/monai/config_test.yml --data-split validation
The results of these inferences are stored in the same folder as the best model.
I ran the following command line to get the performance of the model (for test set):
python ms-lesion-agnostic/monai/plot_performance.py --pred-dir-path ~/net/ms-lesion-agnostic/results/2024-07-18_10\:46\:21.634514/test_set/ --data-json-path ~/net/ms-lesion-agnostic/msd_data/dataset_2024-07-22_seed42_lesionOnly.json --split test
- On the test set
- On the validation set:
- On the train set:
From a visual analysis of the validation set, no clear behaviour appeared:
- bavaria-quebec-spine-ms-unstitched/sub-m058483/ses-20210310/anat/sub-m058483_ses-20210310_acq-ax_chunk-1_T2w.nii.gz
One lesion missed - basel-mp2rage/sub-P103/anat/sub-P103_UNIT1.nii.gz :
Nothing particular - ms-lesion-agnostic/data/bavaria-quebec-spine-ms-unstitched/sub-m333631/ses-20210525/anat/sub-m333631_ses-20210525_acq-ax_chunk-3_T2w.nii.gz
One lesion missed - canproco/sub-cal104/ses-M0/anat/sub-cal104_ses-M0_STIR.nii.gz
Small lesions missed - canproco/sub-mon171/ses-M0/anat/sub-mon171_ses-M0_PSIR.nii.gz
Medium lesion missed in middle and big lesion missed at bottom - nih-ms-mp2rage/sub-nih073/anat/sub-nih073_UNIT1.nii.gz
Missed pretty obvious lesions and segmented one in the middle of the brain - sct-testing-large/sub-bwh001/anat/sub-bwh001_acq-sagstir_T2w.nii.gz
Segmented FP very small lesions, missed big lesion near brain stem - sct-testing-large/sub-karo2011/anat/sub-karo2011_acq-sagcerv_T2star.nii.gz
Big FOV (head to thorathic): segmented a mini lesion near the breast - sct-testing-large/sub-lyonOfsep022/anat/sub-lyonOfsep022_acq-sag_T2w.nii.gz
Missed lesion at top and segmented lesions in the middle (not sure its supposed to be here) - sct-testing-large/sub-vanderbiltSeth013/anat/sub-vanderbiltSeth013_T2star.nii.gz
Segmented nothing eventhough a few obvious lesions
Moving on to a more precise analysis of the results.
When looking at the results on the training set, I noticed that the model performs rather poorly on the basel dataset. I was wondering if maybe it was linked to the field of views of the images.
On the train set:
- sub-P011_UNIT1.nii.gz: image of the brain with lesion in upper spinal cord: pred misses pretty obvious lesions
- sub-P013_UNIT1.nii.gz: image of the brain with lesions in upper spinal cord: pred misses obivous lesions and segments a tiny one in the brain
- sub-P016_UNIT1.nii.gz: image of the brain with lesion in upper spinal cord: pred misses pretty obvious lesions
- sub-nih021_UNIT1.nii.gz: image of the brain with lesion in upper spinal cord: pred segments stuff in the brain
- sub-nih060_UNIT1.nii.gz: image of the brain with lesion in upper spinal cord: pred segments stuff in the brain
- sub-bwh026_acq-sagstir_T2w.nii.gz: image of cervical spinal cord: model misses pretty obvious lesion close to the brain
- sub-karo1898_acq-sagcerv_T2star.nii.gz: image cerv-thor of spinal cord: model misses pretty obvious lesions near C2-C3
- sub-lyonOfsep004_acq-sag_T2w.nii.gz: image of cerv-thor of spinal cord: model misses pretty obvious lesions near C1-C2
From this, I have two conclusions:
- it seems that the model doesn't perform well on images which contain a large portion of the brain
- it seems that the model often misses lesions around level C1 to C3 or in the brain stem
Next steps to be discussed.
To deal with the problem caused by the differences in terms of protocol when segmenting lesions (some segment lesions in the brain stem and others stop at C1), we decided to remove the regions of the images which are above C1. To do so, we will use the contrast-agnostic model to segment the spinal cord and remove all voxels above the highest segmented point in the Inferior-superior direction.
To reduce the variability of the AttentionUnet model, I tried doing some test-time-augmentation.
The experiment was done using the script monai/test_model_tta.py
.
conda activate venv_monai
CUDA_VISIBLE_DEVICES=1 python ms-lesion-agnostic/monai/test_model_tta.py --config ms-lesion-agnostic/monai/config_test.yml --data-split test
We chose to perform 10 predictions for each subject which were stored in separate files.
The dice scores were averaged using the following code:
python ms-lesion-agnostic/monai/average_tta_performance.py --pred-dir-path ~/moneta/users/pierrelouis/ms-lesion-agnostic/tta_exp/test_set
Then we plot the performance of the average dice:
python ms-lesion-agnostic/monai/plot_performance.py --pred-dir-path ~/net/ms-lesion-agnostic/tta_exp/test_set/ --data-json-path ~/net/ms-lesion-agnostic/msd_data/dataset_2024-07-24_seed42_lesionOnly.json --split test
Here is the output of this work with TTA:
Here is a GIF showing the comparison between the performance without TTA and the performance with TTA.
- I am now looking into examples of the algorithm having very different predictions for the same subject
Warning
I just realize I was doing a big mistake in the test inference script. I am cropping the image to the size of the kernel: [64, 128, 128] in RPI before inference. Because of this, the model doesn't see the entire image. I am currently working on fixing this. This means I will also have to recompute the models performances
I removed the cropping of the image before doing inference. It is now giving more accurate results. This can be seen with the fact that fewer predictions have a dice score of 1: this happened when the images were cropped and the cropped image didn't contain any lesion.
However, it also raised the question of which threshold to use on the soft prediction: this is being explored in this issue #32
The result is disappointing. I think the strategy I used is problematic. What I was doing was creating multiple predictions, threshold at 0.5, then calculate each dice score and then average the dice score.
TODO:
- A strategy which would make more sense would be to create multiple predictions, threshold at 0.5, sum the predictions, threshold at 0.5 and then calculate the dice score.
Here is the output without TTA:
The variability is much higher which shows that TTA does help reduce variability. But the average is higher as well. That confirms my hypothesis that when I do TTA, the results are lower because the average drops because out of 10 inference one is completely failed (dice score close to 0%).
TTA second strategy
In a second attempt to improve the results of TTA, I did the following:
perform inference 10 times with different data augmentations
threshold each predictions at 0.5
sum the predictions
threshold the summed prediction at 0.5
computed the dice score
python ms-lesion-agnostic/monai/compute_performance_tta_sum.py --path-pred ~/net/ms-lesion-agnostic/tta_exp/test_set/ --path-json ~/net/ms-lesion-agnostic/msd_data/dataset_2024-07-24_seed42_lesionOnly.json --split test --output-dir ~/net/ms-lesion-agnostic/tta_exp/perf_output_sum_image
Then to compute the figure, I did the following:
python ms-lesion-agnostic/monai/plot_performance.py --pred-dir-path ~/net/ms-lesion-agnostic/tta_exp/perf_output_sum_image/ --data-json-path ~/net/ms-lesion-agnostic/msd_data/dataset_2024-07-24_seed42_lesionOnly.json --split test
Here is the output:
This second strategy is not as effective as I thought it be ! The results are a bit disappointing as well
Now I want to evaluate the normal (meaning no TTA) AttentionUnet model on the following metrics: Dice, PPV, Sensitivity and F1 score.
To do so, I modified the code in monai/test_model.py
to include the computation of the other metrics.
I ran it with the following code:
CUDA_VISIBLE_DEVICES=1 python ms-lesion-agnostic/monai/test_model.py --config ms-lesion-agnostic/monai/config_test.yml --data-split test
To plot the performance:
python ms-lesion-agnostic/monai/plot_performance.py --pred-dir-path ~/net/ms-lesion-agnostic/results/2024-07-18_10\:46\:21.634514/test_set/ --data-json-path ~/net/ms-lesion-agnostic/msd_data/dataset_2024-07-24_seed42_lesionOnly.json --split test
Here are the results: