Why directly add the results of q_attn and k_attn instead of normalizing them?

Question

Why directly add the results of q_attn and k_attn instead of normalizing them?

kinredon opened this issue a year ago · 2 comments

Thank you for your nice work!

I am curious about the combination of q_attn and k_attn. In the paper, the final atten is k_attn + q_attn as shown in
https://github.com/wangf3014/SCLIP/blob/3b40a88665d9398e20d30fe696c1f954978a5cd0/clip/model.py#L298C28-L298C43.

Why the result is not normalized to 0-1, for example, 1/2*(k_attn + q_attn)?

Could you give me some help?

Answer 1 · 2024-01-12T16:31:20.000Z

Hi, thank you for your question. As we apply CSA in the last transformer block, the output will be finally normalized by a softmax layer to get probabilities. So the scale of Attn does not affect the relativeness of logits on different classes. It only changes the temperature of softmax, and we actually have another hyper-parameter "logit_scale" to controll that temperature.

Answer 2 · 2024-01-15T03:25:59.000Z

@wangf3014 Thanks for your reply.