lucidrains/x-transformers

[Question] difference between num_mem_kv and num_memory_tokens

pfeatherstone opened this issue · 7 comments

TransformerWrapper can take num_memory_tokens
AttentionLayers can take num_mem_kv
It seems like these are the same thing no ?
Otherwise, what's the difference?

Also, I think i've asked this somewhere before, but does having memory tokens achieve the same as having a null key/value? Essentially, attending to the memory tokens only, is the same as not attending to input at all, and therefore the same as attending to a null key/value?
Or am I wrong?
I can't see a use-case for having both memory tokens AND a null kv

@pfeatherstone it is slightly different

the memory tokens attends (as well as get attended to) and provides a dedicated information lane, as was explored here

the num_mem_kv are just extra key / values that get attended to, so you can view them as feedforward weights within the attention layer

whether they achieve the same thing will need more follow up papers. i would be curious to see if the vit pathological issues can be alleviated with null / memory key / values alone, or other gating techniques for allowing attention to nothing

I can't see a use-case for having both memory tokens AND a null kv

yes, going forward, i'm choosing one or the other for newer attention model depending on what fits better. if you go with memory tokens, you will not need the null key / values, but that's not true the other way around

in other words, if you care about outliers, i think there's a variety of ways to alleviate it, including null kv and memory tokens. but if you care about attention map interpretability, memory tokens is the proven approach atm

In this paper https://arxiv.org/pdf/2309.16588.pdf they propose adding learnable registers. It looks like those are the same as memory tokens. Maybe. Looks like these are becoming more and more important.

@lucidrains The reason you need a null key-value is because of the softmax layer. Do you know if anyone has every tried replacing softmax with sigmoid. Then, even without a null key-value or memory tokens, the attention map can be zero.

@pfeatherstone yea, people have tried relu, although it underperforms softmax

sigmoid in particular was tried under a different context

you are right both would solve the issue of outputting 0 attention map. the 'vit needs register tokens' should try some of these techniques and see if they similarly resolve the artfifacts, without the need for full memory tokens .