subtraction in attention sharing mechanism
luocfprime opened this issue · 0 comments
luocfprime commented
In the implementation of attention sharing, I noticed there's a stacked temporal attention adapter.
My question is, why did you subtract the modified_hidden_states with input h? Could you share some insights behind the rationale of this design? Thanks!