[Question] difference between num_mem_kv and num_memory_tokens
pfeatherstone opened this issue · 7 comments
TransformerWrapper
can take num_memory_tokens
AttentionLayers
can take num_mem_kv
It seems like these are the same thing no ?
Otherwise, what's the difference?
Also, I think i've asked this somewhere before, but does having memory tokens achieve the same as having a null key/value? Essentially, attending to the memory tokens only, is the same as not attending to input at all, and therefore the same as attending to a null key/value?
Or am I wrong?
I can't see a use-case for having both memory tokens AND a null kv
@pfeatherstone it is slightly different
the memory tokens attends (as well as get attended to) and provides a dedicated information lane, as was explored here
the num_mem_kv
are just extra key / values that get attended to, so you can view them as feedforward weights within the attention layer
whether they achieve the same thing will need more follow up papers. i would be curious to see if the vit pathological issues can be alleviated with null / memory key / values alone, or other gating techniques for allowing attention to nothing
I can't see a use-case for having both memory tokens AND a null kv
yes, going forward, i'm choosing one or the other for newer attention model depending on what fits better. if you go with memory tokens, you will not need the null key / values, but that's not true the other way around
in other words, if you care about outliers, i think there's a variety of ways to alleviate it, including null kv and memory tokens. but if you care about attention map interpretability, memory tokens is the proven approach atm
In this paper https://arxiv.org/pdf/2309.16588.pdf they propose adding learnable registers. It looks like those are the same as memory tokens. Maybe. Looks like these are becoming more and more important.
@lucidrains The reason you need a null key-value is because of the softmax layer. Do you know if anyone has every tried replacing softmax with sigmoid. Then, even without a null key-value or memory tokens, the attention map can be zero.
@pfeatherstone yea, people have tried relu, although it underperforms softmax
sigmoid in particular was tried under a different context
you are right both would solve the issue of outputting 0 attention map. the 'vit needs register tokens' should try some of these techniques and see if they similarly resolve the artfifacts, without the need for full memory tokens .