subtraction in attention sharing mechanism
fkcptlst opened this issue · 0 comments
fkcptlst commented
In the implementation of attention sharing, I noticed there's a stacked temporal attention adapter.
My question is, why did you subtract the modified_hidden_states
with input h
? Could you share some insights behind the rationale of this design? Thanks!