mahmoodlab/HIPT

Worse Performance in CAMELYON16 only

bryanwong17 opened this issue · 5 comments

Hi @Richarizardd , I have a question concerning the worst performance I had in the CAMELYON16 dataset only. Here are the results of my experiment after following all of the settings and pretraining models provided:

  • CAMELYON16
    1 fold: Train: 242 WSIs, Val: 28 WSIs, Test: 129 WSIs
    Mean Test AUC across 10 folds: 0.709 += 0.024
    Mean Test ACC across 10 folds: 0.764 += 0.021

  • UCEC
    1 fold: Train: 668 WSIs, Val: 58 WSIs, Test: 238 WSIs
    Mean Test AUC across 10 folds : 0.991 += 0.004
    Mean Test ACC across 10 folds: 0.958 += 0.012

  • our own dataset (leica colon)
    1 fold: Train: 469 WSIs, Val: 49 WSIs, Test: 223 WSIs
    Mean Test AUC across 10 folds : 0.977 += 0.04
    Mean Test ACC across 10 folds: 0.941 += 0.009

Are you perhaps able to explain why the mean test AUC and ACC in CAMELYON16 aren't that good? Could it be that the pretraining dataset is very different from the training dataset? Is it because there aren't many training slides? In fact, I trained CLAM with the same dataset, distribution for CAMELYON16, and it achieved AUC and ACC of around 85. It is greatly appreciated that you shared your insight. Thank you~

Hi @bryanwong17 - thank you for experimenting with HIPT and sharing the results in your work, and exciting to see that you are seeing some positive results. Several thoughts and questions:

  • What is the dataset source + task for UCEC, and task for your Colon dataset?
  • I never tested with C16, but one reasoning why the performance is low is b/c C16 is a "needle-in-a-haystack" task, with small and hard-to-detect tumor patches. Have you tried using the interpretability in CLAM to create heatmaps? Another reason why the results are not as good is b/c C16 is comparing normal + tumor (where the tumor slides have normal morphology as well). Maybe you can also try also augmenting HIPT via: 1) getting rid of last Transformer (keep params simple for small datasets), and replacing it with 2) multi-branch attention head from CLAM-MB. If you see the supplement,
  • Have you also tried CLAM using the ViT-S/16 weights provided? Also if you are using the DataLoader from CLAM, what out for what normalization it uses, which assumes IN normalization, but the normalization used for HIPT was (0.5,)*3.
  • Why UCEC / COADREAD performance is better: HIPT is able to capture heterogeneity / distinguish undiff vs poor-diff better. An emerging finding I am seeing is that different cancer types have unique inductive biases that should be modeled differently.

Hi @Richarizardd ,

  • If I'm not mistaken, my friend downloaded the UCEC dataset from NIH Genomic Data Commons Data Portal. In both the UCEC and Colon datasets, the task is to predict whether the tissues are normal or tumorous. Attached are my distributions for training, validation, and testing, along with their information

Camelyon16.zip
UCEC.zip
Leica_colon.zip

  • C16 is, as far as I know, a challenging dataset with very small and difficult-to-detect tumor regions, which may explain why it does not perform well. Is it correct that you ask me to replace the "region-aggregation" model in the third stage that makes predictions from [M,192] for each slide? May I ask for your assistance in pointing out CLAM-MB's supplement?

  • No, I haven't. Actually, I also experimented with adding stain normalization macenko when extracting the features, and the AUC and ACC improved as follows:
    Test AUC: 0.709 += 0.024 -> 0.75 += 0.037
    Test ACC: 0.764 += 0.021 -> 0.768 += 0.016

  • I see, thanks for the information

Simple experiment

Test_ACC_Camelyon16_Stain_Norm_Macenko

Test_ACC_UCEC

Test_ACC_Seegene_Leica_Colon_40x

My goal is to take the extracted region features from HIPT and train the transformer encoder and attention network with active learning (you can think of this as in the slide level now). With the annotated budget set at 25 WSIs and 5 generations, the total training is limited to 125 WSIs. Note that the horizontal line represents test ACC when training with 100% data.

According to the figures, the test ACC is already around 0.9 when the model is trained on 25 WSIs (selected randomly). Would you be able to confirm this? Can active learning be used in this way to improve performance on a small training dataset and of course with the right way to select training data for the next generation/iteration?

Hi @bryanwong17 - as furthering commenting on this issue would require asking more project-specifics, will close this issue for now but follow-up via email later.

Hi @Richarizardd , please find my email address here: bryanwong9095@gmail.com, thank you

If you use the default parameters of CLAM on CAMELYON16 dataset, some slides won't get any tissue areas, for example normal_027.tif.