linyq2117/CLIP-ES

cam size

khan1652 opened this issue · 8 comments

To my understanding, I need to match the image size with the attention weight size as (h, w) and (hw, hw) each. But in your code, I cannot find where you change the cam size. It seems like they just match from the first place and I don't understand how.
In my case, the image size is (512, 512) and the attention weight size is (1024, 1024). In order to match them for multiplication, I tried resizing the image into (32, 32) which seems to make my model less accurate. Would there be a way to solve this?

In addition, why are you adding an additional attention weight layer from cam function? What does this layer mean?

Thanks a lot!

Thanks for your interest!

The multiplication between cam and attention weight is matrix multiplication rather than element-wise product. The cam size is not required to match attention weight. Specifically, the cam (h, w) is first resized to (hw, 1), and multiple with attention weight (hw, hw), and finally gets refined cam (hw, 1), as shown in this line. You don't need to manually resize image because our code can process it automatically.

Additionally, the attention weight reflects pixel-wise similarity and thus is used to refine the original cam.

Thank you so much for the quick reply.
But I still don't understand how come if the cam is (h, w), the attention weight is automatically (hw, hw). For my case, the image is (512, 512) but the attention weight outcome is not (512512, 512512) but rather (1024, 1024).

Note that the input image size is not equal to the cam size. For example, the input image size is (H,W), e.g., (512,512). The cam size will be (H//16,W//16), e.g.,(32, 32). This is because CLIP model (Vit-B/16) will split the input image into several 16x16 patches. Thus, the attention weight matches with cam size.

Oh I get this now.
Additionally, is attention_weight_last different from attention weights given from clip encoder? I'm not sure if appending this is necessary.
And how did you set the threshold value like 0.4? I guess it is different for different data sets?

Actually, the attention operation in the last layers is different from previous layers and may lead to the lack of spatial information. We explore this phenomenon in our new AAAI2024 paper. Nonetheless, the use of attention_weight_last has little influence on the performance because we take the average of last attention weight and previous attention weights.

The threshold is truly different for different data sets. We conduct ablation studies in our paper (Section C and D in Appendix) to analyze the effect of different hyper-parameters. The performance is not sensitive to these thresholds so you can simply use the default values.

Thank you so much!
I have one last question. I tried applying your code to my own data and the original matrix has values [0, 1]. After applying bounding boxes and multiplying with attention weights, I get very similar values for the refined matrix like this for example. [[0.7191, 0.7183, 0.7187, ..., 0.7140, 0.7092, 0.7117], [0.7187, 0.7188, 0.7198, ..., 0.7128, 0.7093, 0.7085], ....] I'm struggling to find out why this problem is happening and I thought maybe you could know... sorry for so many questions! Are the given clip parameters for transformer enough for getting attention weights?

Does the original matrix represent cam? The direct way to validate the effectiveness of attention refinement is to visualize the segmentation mask with/without attention weight. If you still have questions, you may provide more details about your data, which will help me determine the practicality of it.

Closed for inactivity.