Question: attn_head_scale with use_scalenorm
pfeatherstone opened this issue · 1 comments
pfeatherstone commented
Am I right in thinking that using use_scalenorm == True
together with attn_head_scale == True
is pointless since ScaleNorm
will undo a learned scalar multiplicative parameter like what attn_head_scale
does.
lucidrains commented
yea, the attention head scaling actually came from the normformer paper, and is applied to each output head of attention, before the linear combination of the merged heads
i actually saw instabilities when i last tried it, and nobody else i know is using it. perhaps i should remove it. these days, i favor more projecting the original input to those heads and gating the output that way