Values of deltaA are very large

Question

Values of deltaA are very large

anhtienng opened this issue 5 months ago · 5 comments

Hi,

The value of A is very large after discretization.
deltaA = torch.exp(delta.unsqueeze(-1) * A)

The big value makes the loss NaN.
I also found the similar problem in the original mamba repo, but I can't find the solution.
I have try the ZOH discretization to avoid the exp function, but it still exits.

Do you know how to solve it ?
Thank you.

Answer 1 · 2024-08-09T14:07:55.000Z

The author of Jamba (hybrid of Mamba & attention) apply inner layernorms to dt (as well as B and C).
I've implemented this in the mamba.py file :
https://github.com/alxndrTL/mamba.py/blob/eddec5da76da6594850ea86a7afa56c9ab6b5ac7/mambapy/mamba.py#L246C8-L246C58

Maybe this will help ?

Answer 2 · 2024-08-09T15:26:44.000Z

The layernorms is not applied for A in the code now.

So you mean I could try to apply it for A ?

Answer 3 · 2024-08-09T16:00:16.000Z

No but it is applied to delta, which is used to compute deltaA, which is very big in your case so that's why I proposed this

Answer 4 · 2024-08-10T05:38:36.000Z

I found the problem, it's because I forgot to use softplus for delta after the projection.
My bad.

Thank you very much.

Answer 5 · 2024-08-10T09:53:57.000Z

Cool it worked out!