Support for NormSoftmax

Question

Support for NormSoftmax

Closed this issue 6 months ago · 16 comments

Based on this paper: https://openreview.net/pdf?id=4g7nCbpjNwd

Would require editing this line:

https://github.com/lucidrains/x-transformers/blob/aabee05d6bca6d74646156009159c55f8d27d884/x_transformers/attend.py#L278C70-L278C75

And replacing the * scale with:

    if self.norm_softmax:
        dots = dots / torch.clamp(dots.std(dim=-1, keepdim=True), min=1e-6)
    else:
        dots *= scale

And then something similar in the other flash attention path

Answer 1 · 2023-12-11T20:07:34.000Z

@catid oh interesting, reminds me a bit of https://arxiv.org/abs/2005.09561

there will also be a temperature involved

have you tried this? maybe i can run a quick experiment tonight

Answer 2 · 2023-12-11T20:07:56.000Z

it won't be compatible with flash attention

Answer 3 · 2023-12-11T20:10:11.000Z

NormSoftmax CIFAR-10 benchmark results at epoch=60 using ViT-tiny:
baseline : 77.69%
sqrtd: 76.39%
inf: 77.53%

NormSoftmax CIFAR-10 benchmark results at epoch=300 using ViT-tiny:
baseline: 85.19%
inf: 85.07%

Manages to get about the same result without the extra parameters

Answer 4 · 2023-12-11T20:14:03.000Z

@catid well yea, so they claim. cifar-10 is a tiny benchmark too

Answer 5 · 2023-12-11T20:15:28.000Z

another engineering obstacle would be handling a masked standard dev

Answer 6 · 2023-12-11T20:16:06.000Z

yea, let me run it tonight on enwik8, but if i don't see anything notable on the first or second try, probably will just drop this

Answer 7 · 2023-12-11T20:16:26.000Z

@lucidrains The masked stddev is like this right? https://github.com/catid/cifar10deepspeed/blob/fe5b399c5ab5f3ed11235d3dbe72952ce7c2be46/models/vit_small.py#L75

I think that's what I'm testing

Answer 8 · 2023-12-11T20:17:10.000Z

@catid i'm thinking for autoregressive text generation (gpt), the triangular causal mask. you are masking out the diagonal?

Answer 9 · 2023-12-11T20:17:56.000Z

Yeah I'm just copying your vit_for_small_dataset.py

Answer 10 · 2023-12-11T20:18:53.000Z

@catid ohh ok, do you see anything? have you ran the experiments yourself? never trust anything a paper says unless you see the curves in front of you 😆

Answer 11 · 2023-12-11T20:19:11.000Z

The results I shared above are from my setup

Answer 12 · 2023-12-11T20:19:54.000Z

@catid wow! ok, i actually put a lot of weight from results from internet randos

ok, let me try it tonight!

Answer 13 · 2023-12-11T20:22:31.000Z

@catid wait, your results show norm softmax to be worse than baseline? is that accuracy?

Answer 14 · 2023-12-11T20:24:06.000Z

@catid can you share a wandb report with training curves?

Answer 15 · 2023-12-11T20:25:31.000Z

I dunno I mean the numbers are pretty close and I only ran N=1 trial so not sure if one method produces better accuracy than the other. Also I don't have wandb integrated into my scripts yet (haven't learned how to use that yet).

Answer 16 · 2023-12-11T20:26:04.000Z

ah, looks to be a negative result.