cam size

Question

cam size

khan1652 opened this issue 4 months ago · 8 comments

To my understanding, I need to match the image size with the attention weight size as (h, w) and (hw, hw) each. But in your code, I cannot find where you change the cam size. It seems like they just match from the first place and I don't understand how.
In my case, the image size is (512, 512) and the attention weight size is (1024, 1024). In order to match them for multiplication, I tried resizing the image into (32, 32) which seems to make my model less accurate. Would there be a way to solve this?

In addition, why are you adding an additional attention weight layer from cam function? What does this layer mean?

Thanks a lot!

Answer 1 · 2024-09-12T08:43:16.000Z

Thanks for your interest!

The multiplication between cam and attention weight is matrix multiplication rather than element-wise product. The cam size is not required to match attention weight. Specifically, the cam (h, w) is first resized to (hw, 1), and multiple with attention weight (hw, hw), and finally gets refined cam (hw, 1), as shown in this line. You don't need to manually resize image because our code can process it automatically.

Additionally, the attention weight reflects pixel-wise similarity and thus is used to refine the original cam.

Answer 2 · 2024-09-12T12:21:45.000Z

Thank you so much for the quick reply.
But I still don't understand how come if the cam is (h, w), the attention weight is automatically (hw, hw). For my case, the image is (512, 512) but the attention weight outcome is not (512512, 512512) but rather (1024, 1024).

Answer 3 · 2024-09-13T10:07:50.000Z

Note that the input image size is not equal to the cam size. For example, the input image size is (H,W), e.g., (512,512). The cam size will be (H//16,W//16), e.g.,(32, 32). This is because CLIP model (Vit-B/16) will split the input image into several 16x16 patches. Thus, the attention weight matches with cam size.

Answer 4 · 2024-09-16T01:45:35.000Z

Oh I get this now.
Additionally, is attention_weight_last different from attention weights given from clip encoder? I'm not sure if appending this is necessary.
And how did you set the threshold value like 0.4? I guess it is different for different data sets?

Answer 5 · 2024-09-16T06:58:25.000Z

Actually, the attention operation in the last layers is different from previous layers and may lead to the lack of spatial information. We explore this phenomenon in our new AAAI2024 paper. Nonetheless, the use of attention_weight_last has little influence on the performance because we take the average of last attention weight and previous attention weights.

The threshold is truly different for different data sets. We conduct ablation studies in our paper (Section C and D in Appendix) to analyze the effect of different hyper-parameters. The performance is not sensitive to these thresholds so you can simply use the default values.

Answer 6 · 2024-09-16T13:06:33.000Z

Thank you so much!
I have one last question. I tried applying your code to my own data and the original matrix has values [0, 1]. After applying bounding boxes and multiplying with attention weights, I get very similar values for the refined matrix like this for example. [[0.7191, 0.7183, 0.7187, ..., 0.7140, 0.7092, 0.7117], [0.7187, 0.7188, 0.7198, ..., 0.7128, 0.7093, 0.7085], ....] I'm struggling to find out why this problem is happening and I thought maybe you could know... sorry for so many questions! Are the given clip parameters for transformer enough for getting attention weights?

Answer 7 · 2024-09-18T01:36:38.000Z

Does the original matrix represent cam? The direct way to validate the effectiveness of attention refinement is to visualize the segmentation mask with/without attention weight. If you still have questions, you may provide more details about your data, which will help me determine the practicality of it.

Answer 8 · 2024-10-08T05:59:32.000Z

Closed for inactivity.