Abnormal values in mixing coefficients of token shift #188

Triang-jyed-driung · 2023-09-22T08:37:43Z

I have posted this issue in Discord a week ago, but no one has yet replied, I don't know exactly what is happening.
The point is that some mixing coefficients in token shift are abnormally large.
The RWKV paper says

The token shift or time-shift mixing, or (diagonal arrows in Figure 3), also contributes to the model’s adaptation to sequential data. By linearly interpolating between the current input and the previous time step input, the model naturally aggregates and gates information in the input channels.

which means that token shift is an interpolation (rather than extrapolation) between the current token and the previous token, therefore mixing coefficients should stay in [0,1]. But some of the coefficients are abnormally large.
This is from the RWKV-4-World-CHNtuned-0.1B model:

Some numbers go as large as 17, while some goes to -17, but theoretically they are interpolations and should fall in [0,1].
This behavior might eventually lead to gradient explosion, resulting to numerical instability.

Also, I noticed that this token shift trick is not commonly seen in other models, such as LSTM or GPT.
Is it Bo Peng's another invention?

The text was updated successfully, but these errors were encountered:

BlinkDL · 2023-10-13T14:48:48Z

Hi yes TokenShift is invented by me.

larger than 1 values can work as a "sharpen filter". No it wont cause numerical instability.

VatsaDev · 2023-10-18T18:19:57Z

What do you mean by "Sharpen filter" what does that mean for inputs?

Sh1n1ma · 2024-03-25T14:53:57Z

I suspect I've encountered a similar issue during training, but it requires further investigation. Above is my training loss. (Note: I simply replaced the transformer block in the MAE task with a **VisionRWKV** block.)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Abnormal values in mixing coefficients of token shift #188

Abnormal values in mixing coefficients of token shift #188

Abnormal values in mixing coefficients of token shift #188

Abnormal values in mixing coefficients of token shift #188

Comments