lucidrains/x-transformers

Question: num_memory_tokens > 0 and return_mems = True

Closed this issue · 3 comments

I'm investigating XL-recurrence while preserving num_memory_tokens > 0.
Looking at the code, it looks like mem is prepended to k and v AFTER memory tokens have been prepended. By memory tokens, I mean those added through num_memory_tokens > 0, NOT attn_num_mem_kv > 0.
The sequence going into attention is:
| mems | memory tokens | data |
Is this correct?
I would have thought the following would be more correct:
| memory tokens | mems | data |

Cheers

By the way, I still don't understand the difference between num_memory_tokens > 0 and attn_num_mem_kv > 0. I can see from the code they are added at different stages, the former early on, the latter in the attention layer specifically, and each attention layer gets its own mem_k and mem_v. However, fundamentally, I don't see the difference in what they are trying to achieve.

By the way, this point was discussed in #193.

memory tokens can also query and representation evolves as it goes through the network. the keys and values change with the context

memory key / values are static

in my mind, they both the address some similar issues, but memory tokens are more powerful. memory tokens also only make sense in encoder setups (although i have an improvised interspersed memory tokens for causal in the repo, not sure if it works with XL). you should just use memory key / values, 4 should be enough