[go: nahoru, domu]

Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Abnormal values in mixing coefficients of token shift #188

Open
Triang-jyed-driung opened this issue Sep 22, 2023 · 3 comments
Open

Abnormal values in mixing coefficients of token shift #188

Triang-jyed-driung opened this issue Sep 22, 2023 · 3 comments

Comments

@Triang-jyed-driung
Copy link

I have posted this issue in Discord a week ago, but no one has yet replied, I don't know exactly what is happening.
The point is that some mixing coefficients in token shift are abnormally large.
The RWKV paper says

The token shift or time-shift mixing, or (diagonal arrows in Figure 3), also contributes to the model’s adaptation to sequential data. By linearly interpolating between the current input and the previous time step input, the model naturally aggregates and gates information in the input channels. 

which means that token shift is an interpolation (rather than extrapolation) between the current token and the previous token, therefore mixing coefficients should stay in [0,1]. But some of the coefficients are abnormally large.
This is from the RWKV-4-World-CHNtuned-0.1B model:
image
image
image
Some numbers go as large as 17, while some goes to -17, but theoretically they are interpolations and should fall in [0,1].
This behavior might eventually lead to gradient explosion, resulting to numerical instability.

Also, I noticed that this token shift trick is not commonly seen in other models, such as LSTM or GPT.
Is it Bo Peng's another invention?

@BlinkDL
Copy link
Owner
BlinkDL commented Oct 13, 2023

Hi yes TokenShift is invented by me.

larger than 1 values can work as a "sharpen filter". No it wont cause numerical instability.

@VatsaDev
Copy link

What do you mean by "Sharpen filter" what does that mean for inputs?

@Sh1n1ma
Copy link
Sh1n1ma commented Mar 25, 2024
Snipaste_2024-03-24_20-27-01 I suspect I've encountered a similar issue during training, but it requires further investigation. Above is my training loss. (Note: I simply replaced the transformer block in the MAE task with a **VisionRWKV** block.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants