the reasoning behind attention heatmaps code

Hi, I thoroughly went through the attention heatmap generation code and there is one thing I have trouble understanding.
I'd love to hear your take on this as it would allow me to get the part of the picture I'm missing.

To keep it simple, let's focus on the create_patch_heatmaps_indiv function.

HIPT/HIPT_4K/attention_visualization_utils.py

Line 311 in b5f4844

    
           patch2 = add_margin(patch.crop((16,16,256,256)), top=0, left=0, bottom=16, right=16, color=(255,255,255))

In the line above, you're taking the (240, 240) bottom crop of the input patch, then paste it in the top-left corner of a white (256, 256) image. Then, you retrieve the attention scores for the original input patch, as well as for patch2.
Eventually, you combine both attention scores in the following lines:

HIPT/HIPT_4K/attention_visualization_utils.py

Lines 342 to 346 in b5f4844

    
           new_score256_2 = np.zeros_like(score256_2) 
        
           new_score256_2[offset_2:s, offset_2:s] = score256_2[:(s-offset_2), :(s-offset_2)] 
        
           overlay256 = np.ones_like(score256_2)*100 
        
           overlay256[offset_2:s, offset_2:s] += 100 
        
           score256 = (score256_1+new_score256_2)/overlay256

Here all you do is restricting the attention scores from scores256_2 the those corresponding to the tissue crop in patch2.
Then, you sum scores256_1 and new_scores256_2, making sure to divide by a twice bigger weight (200) the portion where scores256_1 and scores256_2 overlap (because they represent the same tissue crop).

I draw a summary of what is happening:

My question then boils down to: what is the reasoning behind blending a crop and not simply computing scores256 via:

_, a256 = get_patch_attention_scores(patch, model256, device256=device256)
score256 = get_scores256(a256[:,i,:,:], size=(s,)*2)
score256 = score256 / 100

Thanks!

Hi @clemsgrs

Great visualization. The main reasoning for blending was to develop smooth attention maps similar to that of the heatmap generation code in CLAM, which similarly performs block blending. As you have described and visualized, a small offset is added such that the image is shifted by 16 pixels, with the scores averaged in a way s.t. the padded portion of image do not contribute toward the heatmap. Ultimately, you get much smoother heatmap that may aid in not only qualitative analysis, but also potential post-hoc zero-shot segmentation and detection applications. Since both ViTs have patch sizes of 16 (instead of 8), this patch overlap strategy may help with the latter application.

	new_score256_2 = np.zeros_like(score256_2)
	new_score256_2[offset_2:s, offset_2:s] = score256_2[:(s-offset_2), :(s-offset_2)]
	overlay256 = np.ones_like(score256_2)*100
	overlay256[offset_2:s, offset_2:s] += 100
	score256 = (score256_1+new_score256_2)/overlay256