pals-ttic/adapting-CLIP

Something wrong in Eq. (7) in the manuscript

Closed this issue · 7 comments

Hi,

Thanks for your great work!

May I ask if there is something wrong in the first term of Eq. (7) in the manuscript as there are duplicated WV without keys (K).

Thanks.

Hi @rshaojimmy,

Thanks for the catch. There is a typo in Eq. (7).
The first WV should be WK, as an attention takes in a Query and a Key.

We will update the arxiv version.

Best,
Raymond

Got it. Thanks!

May I further ask what is the second term in Eq. (7) for? The first term is the self-attention conduced within the region r. Why should add one more second term compared to normal self-attention?

Thanks.

Recall, we defined \mathcal{R} to be a set of patch indices, i.e., it does not contain the region token r(l).
In a normal self-attention, each token also computes an attention with itself. Hence, we needed the second term.

Side Note: We could have defined the set \mathcal{R} to also include the region token then we will only have the first term. However, this requires a single notation for both patch (f) and region token (r), which we thought might confuse the reader.

Thanks! But it seems that this paper did not explicitly mention that \mathcal{R} does not contain the region token r(l) in the manuscript.

In the paper, "\mathcal{R} denotes a set of patch indices covered by the region".
As a region token does not correspond to a patch, it is not included in \mathcal{R}. We can make this more explicit.
Thanks for pointing this out.

I see. Thanks so much.