wwyi1828/CluSiam

Clarification on Patch-wise Label Assignment

Closed this issue · 8 comments

Hello,

I am seeking clarification on how the labels for patches are assigned in the project. I apologize if I have overlooked any relevant information regarding this.

Is it correct to assume that every patch takes the label of the Whole Slide Image (WSI) it emerged from?

Thank you for your assistance!

Kind regards

The patch labels are not simply inherited from the WSI labels. Instead, each individual patch is assigned a label based on the provided annotations, such as those in the official XML files for the Camelyon16 dataset.

These annotations delineate the specific regions within each WSI that contain tumor tissue. By cross-referencing the coordinates of the extracted patches with these annotated regions, each patch can be more accurately labeled as different classes depending on whether it overlaps with the specifically annotated areas.

Thank you very much for your clarification. This was very helpful!

Hello,

Thank you for providing the preprocessed patches—it has been very helpful! However, I have a follow-up question, and would appreciate your clarification on a few points:

Preprocessed Patches Count:
I downloaded the folder containing the preprocessed patches from your repository, which currently holds approximately 2.6 million patches, matching the count stated for the training set. Could you please confirm whether this is expected or if the patches should include the test set and something might have gone wrong during my download process?

Annotation Groups in Tumor Slides:
I’m encountering some confusion regarding the annotation groups for certain tumor slides. For example, the slide tumor095 has 30 annotations according to its XML file. These are grouped into _0 and _2 under the PartOfGroup attribute. Although I haven’t found an official explanation, it seems that groups _0 and _1 correspond to tumorous regions, while _2 relates to normal tissue. One example of this interpretation can be found here. If this interpretation is correct, annotations 0, 21-23, and 27-30 should be classified as tumorous.
In the preprocessed patches folder for tumor095, I see four subfolders: NotAnnotated, Annotation 0, Annotation 22, and Annotation 23. Can you give some insight on how these subfolders were generated?

I appreciate any information you can provide on these points.
Thank you for your time!

Thank you for your question and for bringing these points to my attention.

  1. Preprocessed Patches Count: You are correct; the current folder in the repository only contains the preprocessed patches for the training set, which amounts to approximately 2.6 million patches. I apologize for any confusion this may have caused. Due to the large size, uploading is slow and prone to interruptions. My previous attempt to upload the test set patches failed, and I haven't yet tried to re-upload them. However, I will attempt to upload the test set for you to download if needed.

  2. Annotation Groups in Tumor Slides: Regarding the annotation groups for tumor slides, particularly in the case of tumor_095, I understand your confusion. The presence of smaller polygons within a larger polygon is indeed a special case. While I have not found an official explanation, considering the morphology of the cells within these smaller polygons, which appear more similar to normal cells, I believe they are annotated as "holes" representing normal regions surrounded by tumor cells. Based on this observation, I believe that the PartOfGroup of these smaller polygons in tumor_095 indicates that they are annotated as "holes" within the larger polygon, representing normal cells surrounded by tumor cells. As for the annotated regions in other WSIs, I believe they are all tumors. Slide tumor_095 should be an exception. I have just uploaded the thumbnail images I currently have, visualizing the argmax of the clustering pretext task for all training slides. In these images, green lines depict all the annotation polygons. You can examine the areas outlined by the green lines to gain a better intuition of this particular case. (https://www.dropbox.com/scl/fo/6rqm9s80lickrx8ga08ao/AEnVr9T8j0PFQ6ybY_GT7YU?rlkey=mdlunmnrnn7ccn8zpib0445k0&st=acfgdljw&dl=0)

Hello @wwyi1828,

Thank you for your quick and detailed response! I now have a much clearer understanding of the distinction between the annotation groups, especially regarding slide tumor_095 and its special case with the "holes" representing normal cells surrounded by tumor cells. Just to confirm: the three directories—Annotation 0, Annotation 22, and Annotation 23—should contain all the tumorous patches, correct?
I was able to extract patches from the train and test sets based on the DSMIL code. However, I noticed some discrepancies compared to your training set, so it would be great if you could upload your test set as well.

In case others come across this and find it useful, I used the parameters --magnifications [1] and --background_t 25, which extracted 2,703,006 patches, of which 329,882 are tumorous. By comparison, CluSiam's training set contains 2,616,974 patches, of which 108,595 are annotated, thus are tumorous.

On a related note, for the classification task using DSMIL, did you implement a 5-fold cross-validation? I’m curious whether you set the eval_scheme to 5-fold-cv-standalone-test or used the 5-time-train+valid+test scheme.

Thanks again for all your help—I really appreciate it!

Hi,

Polygons that are not contained within other polygons are considered positive annotations. Therefore, I believe it is correct if these groups are not located inside other annotated polygons. I will retrieve my hard drive and upload the preprocessed test set this weekend.

I am not familiar with the DSMIL preprocessing pipeline. To check what is happening, I suggest overlaying the extracted patches with colors on the original whole slide images. In the preprocessing pipeline that I used, it did miss a few negative non-background regions (e.g., normal_040). However, tumorous regions should be well-covered. For more details, you can refer to the previous visualization folder, where green/orange colored regions are extracted, and regions without color overlay are the ones that were missed.

https://www.dropbox.com/scl/fi/ppbexw95pjr2xvfc2ct5h/Camelyon_20xpatch_test.zip?rlkey=d20yh2zfndwxkbpzlhkdsrhyi&st=z8zgmasb&dl=0
Here is the link to download the test set. For data splitting, I did not perform cross-validation. Instead, I kept the original test set intact. The validation set is split from the training set.

Hello,

Thank you for your detailed response and for providing the link to the test set! I’ve successfully downloaded it.

Your explanation about the annotation groups helped clear things up, especially around how positive annotations are defined. I also appreciate the clarification on the evaluation process.

I was able to overlay the patches extracted by the DSMIL over the slide. Although it is mostly coherent there are some discrepancies, probably due to a different background threshold.

Thanks again for your explanation and for uploading the test set.