Question: attn_head_scale with use_scalenorm

Question

Question: attn_head_scale with use_scalenorm

pfeatherstone opened this issue 7 months ago · 1 comments

Am I right in thinking that using use_scalenorm == True together with attn_head_scale == True is pointless since ScaleNorm will undo a learned scalar multiplicative parameter like what attn_head_scale does.

Answer 1 · 2023-10-25T16:14:19.000Z

yea, the attention head scaling actually came from the normformer paper, and is applied to each output head of attention, before the linear combination of the merged heads

i actually saw instabilities when i last tried it, and nobody else i know is using it. perhaps i should remove it. these days, i favor more projecting the original input to those heads and gating the output that way