Evaluation of ms lesion segmentation algorithm

Question

Evaluation of ms lesion segmentation algorithm

plbenveniste opened this issue 7 months ago · 19 comments

Opening this issue to discuss the results and the performances of the models trained to segment MS lesions in the spinal cord.

Answer 1 · 2024-06-04T14:52:32.000Z

Discussing the results of the model output stored on ms_lesion_agnostic/results/2024-04-21_16/06/04.890513

Here is the model performance on the training set (data from canproco and sct-testing-large)

Here are the detailed observations:

Lesions GT segmentations that should be corrected
- sub-edm181_ses-M0_PSIR
- sub-cal156_ses-M0_STIR
- sub-cal104_ses-M0_STIR: lesions are too small to be kept
- sub-mon004_ses-M0_PSIR
- sub-mon010_ses-M12_PSIR (lesions are too small): either remove them on the lesion segmentation or remove during pre-processing
- sub-mon137_ses-M0_PSIR (to be discussed)
- sub-karo1898_acq-sagcerv_T2star (to be discussed)
- sub-karo2032_acq-sagcerv_T2star
- sub-karo2039_acq-sagcerv_T2star
- sub-nyuShepherd022_acq-sup_T2w
Identified problems:
- The model tends to segment lesions which are too small. Problem: some input data contains lesion segmentations which are very small. Solution: either remove those lesions on the lesion segmentation directly or remove them during pre-processing or remove them after model prediction?
- The model doesn’t segment lesions close to the brain. (ex: sub-amuVirginie009_T2w, sub-bwh026_acq-sagstir_T2w, sub-lyonOfsep004_acq-sag_T2w, sub-lyonOfsep082_acq-sag_T2w, sub-milanFilippi064_acq-sag_T2w)
- In larger fields of view than canproco, model usually doesn't segment lesions close to the brain, but still segments lesions in the cervical/thoracic spinal cord: (sub-lyonOfsep001_acq-sag_T2w)
- The subject sub-lyonOfsep079_acq-sag_T2w : should be excluded as problem when reconstructing the image
- Rare issue of model segmenting outside the spinal cord : sub-rennesMS018_acq-sagthor_T2w, sub-rennesMS027_acq-sagthor_T2w, sub-rennesMS049_acq-sagthor_T2w: maybe because spinal cord is not centered. The model tends to segment in the middle (on the vertical axis)

Conclusion:

Because of the use of RandCropByPosNegLabeld, the model tends to segment in the middle of the spinal cord (on the superior-inferior axis): this can be solved by modifying/removing this DA strategy.
Lesions which are below a certain volume (to be defined) should not be kept during model training and during inference.

Answer 2 · 2024-06-04T15:23:39.000Z

Below the GIFs of the problematic images quoted above:

sub-edm181_ses-M0_PSIR

sub-cal156_ses-M0_STIR

sub-cal104_ses-M0_STIR

sub-mon004_ses-M0_PSIR

sub-mon010_ses-M12_PSIR

sub-mon137_ses-M0_PSIR

sub-karo1898_acq-sagcerv_T2star

sub-karo2039_acq-sagcerv_T2star

sub-nyuShepherd022_acq-sup_T2w

Answer 3 · 2024-06-11T14:47:27.000Z

Here is a graph presenting the results of the investigations done:

This shows a comparison between:

the current state of the art model on which the above observations were done (called BestAttention+spacing1;0)
the SOTA model without the RandCropByPosNeg augmentation function
the SOTA model without the RandCropByPosNeg augmentation function and no skipping of empty patch
the SOTA model with a SpatialCrop function on top of RandCropByPosNeg to add variability
the SOTA model with the removeSmall augmentation function

As we can see on the graphs, the explorations did not outperform the current SOTA model.

As suggested by @jcohenadad exploring the nnUNet data augmentation strategies could help : link to data augmentation

Answer 4 · 2024-06-11T16:32:10.000Z

Very interesting investigations @plbenveniste !

Answer 5 · 2024-06-12T15:13:46.000Z

The current investigation is to train a MedNext model for MS lesion segmentation:

The first investigation was to train a MedNext model with a low number of channels : 3 (to see if it works). It had a weird behavior as we can see on the following graph (comparison with the SOTA model in violet and the low channel MedNext model).

The current MedNext model is with the highest number of channel possible (n_channels=16) due to the GPU limitations (using 42GB out of 47GB). It is currently training, but it seems that it is slightly under-performing the SOTA model.
The next investigation I want to do is to increase the number of channel in MedNext and, to do so, reduce the number of samples produced by RandCropByPosNeg (num_samples=4).

Here is the output of the investigation of increasing the size of the MedNext model:

Answer 6 · 2024-07-18T14:04:12.000Z

I am currently running some experiments on all the data after it was reformated. The input data is the following (stored on moneta): /home/plbenveniste/net/ms-lesion-agnostic/msd_data/dataset_2024-06-26_seed42_lesionOnly.json

One thing I had to do was set num_workers=0 when loading the validation dataset otherwise the training would crash with entire dataset when doing the first validation step.

Here are my findings:

decreasing the resampling to iso 0.5 mm (instead of 0.7) decreases the model performances
increasing the AttUnet dropout ratio to 0.2 (instead of 0.1) decreases the performances
it seems that increasing model depth from [32, 64, 128, 256, 512] to [32, 64, 128, 256, 512, 1024] didn't affect the model performance.

Answer 7 · 2024-07-22T21:57:58.000Z

Other findings:

using DiceFocalLoss made training fail
using GeneralizedDiceFocalLoss made training fail
using a model depth which ranged from 64 to 1024 improved slightly (~2%) the model performances: however, it requires a lot more memory. This made not possible training with RandCropByPosNeg with a num_sample greater than 2.

For now, my conclusions are:

the best resampling resolution is 0.7 isotropic
the nnUNet data augmentation didn't improve the model performance
the best model architecture is AttUnet with depth [64, 128, 256, 512, 1024]

TODO:

Use the best model to perform inference on the training and validation images.
Identify the images on which the model is poorly performing
Manually correct these images and/or remove than from the training/validation (using an exclude.yml file for instance)
Retrain the model

Answer 8 · 2024-07-23T18:31:54.000Z

The best model is stored in /home/plbenveniste/net/ms-lesion-agnostic/results/2024-07-18_10:46:21.634514/ it is called BestAtt+allData_06-26 in wandb.

The inference was performed on the trainining, validation and testing set using the following command on kronos:

conda activate venv_monai
CUDA_VISIBLE_DEVICES=1 python ms-lesion-agnostic/monai/test_model.py --config ms-lesion-agnostic/monai/config_test.yml --data-split train
CUDA_VISIBLE_DEVICES=2 python ms-lesion-agnostic/monai/test_model.py --config ms-lesion-agnostic/monai/config_test.yml --data-split test
CUDA_VISIBLE_DEVICES=3 python ms-lesion-agnostic/monai/test_model.py --config ms-lesion-agnostic/monai/config_test.yml --data-split validation

The results of these inferences are stored in the same folder as the best model.

Answer 9 · 2024-07-23T19:46:37.000Z

I ran the following command line to get the performance of the model (for test set):

python ms-lesion-agnostic/monai/plot_performance.py --pred-dir-path ~/net/ms-lesion-agnostic/results/2024-07-18_10\:46\:21.634514/test_set/ --data-json-path ~/net/ms-lesion-agnostic/msd_data/dataset_2024-07-22_seed42_lesionOnly.json --split test

On the test set

On the validation set:

On the train set:

Answer 10 · 2024-07-23T20:38:47.000Z

From a visual analysis of the validation set, no clear behaviour appeared:

bavaria-quebec-spine-ms-unstitched/sub-m058483/ses-20210310/anat/sub-m058483_ses-20210310_acq-ax_chunk-1_T2w.nii.gz
One lesion missed
basel-mp2rage/sub-P103/anat/sub-P103_UNIT1.nii.gz :
Nothing particular
ms-lesion-agnostic/data/bavaria-quebec-spine-ms-unstitched/sub-m333631/ses-20210525/anat/sub-m333631_ses-20210525_acq-ax_chunk-3_T2w.nii.gz
One lesion missed
canproco/sub-cal104/ses-M0/anat/sub-cal104_ses-M0_STIR.nii.gz
Small lesions missed
canproco/sub-mon171/ses-M0/anat/sub-mon171_ses-M0_PSIR.nii.gz
Medium lesion missed in middle and big lesion missed at bottom
nih-ms-mp2rage/sub-nih073/anat/sub-nih073_UNIT1.nii.gz
Missed pretty obvious lesions and segmented one in the middle of the brain
sct-testing-large/sub-bwh001/anat/sub-bwh001_acq-sagstir_T2w.nii.gz
Segmented FP very small lesions, missed big lesion near brain stem
sct-testing-large/sub-karo2011/anat/sub-karo2011_acq-sagcerv_T2star.nii.gz
Big FOV (head to thorathic): segmented a mini lesion near the breast
sct-testing-large/sub-lyonOfsep022/anat/sub-lyonOfsep022_acq-sag_T2w.nii.gz
Missed lesion at top and segmented lesions in the middle (not sure its supposed to be here)
sct-testing-large/sub-vanderbiltSeth013/anat/sub-vanderbiltSeth013_T2star.nii.gz
Segmented nothing eventhough a few obvious lesions

Moving on to a more precise analysis of the results.

Answer 11 · 2024-07-25T15:10:32.000Z

When looking at the results on the training set, I noticed that the model performs rather poorly on the basel dataset. I was wondering if maybe it was linked to the field of views of the images.

On the train set:

sub-P011_UNIT1.nii.gz: image of the brain with lesion in upper spinal cord: pred misses pretty obvious lesions
sub-P013_UNIT1.nii.gz: image of the brain with lesions in upper spinal cord: pred misses obivous lesions and segments a tiny one in the brain
sub-P016_UNIT1.nii.gz: image of the brain with lesion in upper spinal cord: pred misses pretty obvious lesions
sub-nih021_UNIT1.nii.gz: image of the brain with lesion in upper spinal cord: pred segments stuff in the brain
sub-nih060_UNIT1.nii.gz: image of the brain with lesion in upper spinal cord: pred segments stuff in the brain
sub-bwh026_acq-sagstir_T2w.nii.gz: image of cervical spinal cord: model misses pretty obvious lesion close to the brain
sub-karo1898_acq-sagcerv_T2star.nii.gz: image cerv-thor of spinal cord: model misses pretty obvious lesions near C2-C3
sub-lyonOfsep004_acq-sag_T2w.nii.gz: image of cerv-thor of spinal cord: model misses pretty obvious lesions near C1-C2

From this, I have two conclusions:

it seems that the model doesn't perform well on images which contain a large portion of the brain
it seems that the model often misses lesions around level C1 to C3 or in the brain stem

Next steps to be discussed.

Answer 12 · 2024-07-25T20:33:20.000Z

To deal with the problem caused by the differences in terms of protocol when segmenting lesions (some segment lesions in the brain stem and others stop at C1), we decided to remove the regions of the images which are above C1. To do so, we will use the contrast-agnostic model to segment the spinal cord and remove all voxels above the highest segmented point in the Inferior-superior direction.

Answer 13 · 2024-09-04T14:26:01.000Z

To reduce the variability of the AttentionUnet model, I tried doing some test-time-augmentation.
The experiment was done using the script monai/test_model_tta.py.

conda activate venv_monai
CUDA_VISIBLE_DEVICES=1 python ms-lesion-agnostic/monai/test_model_tta.py --config ms-lesion-agnostic/monai/config_test.yml --data-split test

We chose to perform 10 predictions for each subject which were stored in separate files.
The dice scores were averaged using the following code:

python ms-lesion-agnostic/monai/average_tta_performance.py --pred-dir-path ~/moneta/users/pierrelouis/ms-lesion-agnostic/tta_exp/test_set

Then we plot the performance of the average dice:

python ms-lesion-agnostic/monai/plot_performance.py --pred-dir-path ~/net/ms-lesion-agnostic/tta_exp/test_set/ --data-json-path ~/net/ms-lesion-agnostic/msd_data/dataset_2024-07-24_seed42_lesionOnly.json --split test

Here is the output of this work with TTA:

Here is a GIF showing the comparison between the performance without TTA and the performance with TTA.

I am now looking into examples of the algorithm having very different predictions for the same subject

Answer 14 · 2024-09-04T17:44:04.000Z

Warning

I just realize I was doing a big mistake in the test inference script. I am cropping the image to the size of the kernel: [64, 128, 128] in RPI before inference. Because of this, the model doesn't see the entire image. I am currently working on fixing this. This means I will also have to recompute the models performances

Answer 15 · 2024-09-04T18:55:46.000Z

I removed the cropping of the image before doing inference. It is now giving more accurate results. This can be seen with the fact that fewer predictions have a dice score of 1: this happened when the images were cropped and the cropped image didn't contain any lesion.

However, it also raised the question of which threshold to use on the soft prediction: this is being explored in this issue #32

Answer 16 · 2024-09-09T15:27:34.000Z

Here are the result with TTA:

The result is disappointing. I think the strategy I used is problematic. What I was doing was creating multiple predictions, threshold at 0.5, then calculate each dice score and then average the dice score.

TODO:

A strategy which would make more sense would be to create multiple predictions, threshold at 0.5, sum the predictions, threshold at 0.5 and then calculate the dice score.

Answer 17 · 2024-09-09T16:44:51.000Z

Here is the output without TTA:

The variability is much higher which shows that TTA does help reduce variability. But the average is higher as well. That confirms my hypothesis that when I do TTA, the results are lower because the average drops because out of 10 inference one is completely failed (dice score close to 0%).

Answer 18 · 2024-09-11T19:51:30.000Z

TTA second strategy

In a second attempt to improve the results of TTA, I did the following:
perform inference 10 times with different data augmentations
threshold each predictions at 0.5
sum the predictions
threshold the summed prediction at 0.5
computed the dice score

python ms-lesion-agnostic/monai/compute_performance_tta_sum.py --path-pred ~/net/ms-lesion-agnostic/tta_exp/test_set/ --path-json ~/net/ms-lesion-agnostic/msd_data/dataset_2024-07-24_seed42_lesionOnly.json --split test --output-dir ~/net/ms-lesion-agnostic/tta_exp/perf_output_sum_image

Then to compute the figure, I did the following:

python ms-lesion-agnostic/monai/plot_performance.py --pred-dir-path ~/net/ms-lesion-agnostic/tta_exp/perf_output_sum_image/ --data-json-path ~/net/ms-lesion-agnostic/msd_data/dataset_2024-07-24_seed42_lesionOnly.json --split test

Here is the output:

This second strategy is not as effective as I thought it be ! The results are a bit disappointing as well

Answer 19 · 2024-09-25T21:16:18.000Z

Now I want to evaluate the normal (meaning no TTA) AttentionUnet model on the following metrics: Dice, PPV, Sensitivity and F1 score.
To do so, I modified the code in monai/test_model.py to include the computation of the other metrics.
I ran it with the following code:

CUDA_VISIBLE_DEVICES=1 python ms-lesion-agnostic/monai/test_model.py --config ms-lesion-agnostic/monai/config_test.yml --data-split test

To plot the performance:

python ms-lesion-agnostic/monai/plot_performance.py --pred-dir-path ~/net/ms-lesion-agnostic/results/2024-07-18_10\:46\:21.634514/test_set/ --data-json-path ~/net/ms-lesion-agnostic/msd_data/dataset_2024-07-24_seed42_lesionOnly.json --split test

Here are the results: