Regarding the implementation of self and cross-attention
xiaopengguo opened this issue · 2 comments
xiaopengguo commented
I'm curious about the insights behind adding positional embedding to the q and k, but not to the v in both self and cross-attention; is the positional embedding added in each attention block, and if so, why? Looking forward to further insights, and thank you in advance!
franciszzj commented
I'm sorry for the late reply. This is a good question.
Initially, we didn't give it much thought and simply followed the settings in CRIS's code. I believe there is a typo here.
After making modifications and experimenting, I found that the results with and without adding positional embedding to the “value” are similar. I hope this can be of reference to you.
xiaopengguo commented
Thank you for your reply!