BlinkDL/RWKV-LM

Abnormal values in mixing coefficients of token shift

Triang-jyed-driung opened this issue · 3 comments

I have posted this issue in Discord a week ago, but no one has yet replied, I don't know exactly what is happening.
The point is that some mixing coefficients in token shift are abnormally large.
The RWKV paper says

The token shift or time-shift mixing, or (diagonal arrows in Figure 3), also contributes to the model’s adaptation to sequential data. By linearly interpolating between the current input and the previous time step input, the model naturally aggregates and gates information in the input channels. 

which means that token shift is an interpolation (rather than extrapolation) between the current token and the previous token, therefore mixing coefficients should stay in [0,1]. But some of the coefficients are abnormally large.
This is from the RWKV-4-World-CHNtuned-0.1B model:
image
image
image
Some numbers go as large as 17, while some goes to -17, but theoretically they are interpolations and should fall in [0,1].
This behavior might eventually lead to gradient explosion, resulting to numerical instability.

Also, I noticed that this token shift trick is not commonly seen in other models, such as LSTM or GPT.
Is it Bo Peng's another invention?

Hi yes TokenShift is invented by me.

larger than 1 values can work as a "sharpen filter". No it wont cause numerical instability.

What do you mean by "Sharpen filter" what does that mean for inputs?