Layernorm in Cross attention
turtleman99 opened this issue · 4 comments
I'm wondering why we don't need layernorm
for K and V, but need it for Q in cross attention. Is there any paper I can refer to? Thanks a lot.
vit-pytorch/vit_pytorch/cross_vit.py
Lines 53 to 71 in 0ad09c4
it has been a while, but can you check to see if the context being received isn't coming from another transformer with a final layernorm?
oh, this is a good point. My x
and context
are actually features from pre-trained ViT models. I believe I can remove layernorm
for x
and context
, right?
@turtleman99 if you are following the same scheme as cross vit, then the code is correct and stays the same. x
is layernormed (pre-layernorm configuration + residual), while context
is left alone since it is already layernormed from your pretrained ViT model. You would only need to layernorm the context if the context also cross attends to x
and is updated, like in the ISAB architecture, but that isn't what the cross vit authors did
Gotcha. Thank you so much for your explanations! :)