Making Unlimiformer work with decoder models (specifically LLaMA)

This is a related issue, and in this issue I'd like to get technical.
I've been working on trying to adapt Unlimiformer to work with LLaMA for a while now, and it has come down to two main issues (so far): naming and architecture.
As an example of naming, in Unlimiformer.create_key_value the original attention is calculated from encoder_attn:

unlimiformer/src/unlimiformer.py

Line 751 in 5b534d1

attention = decoder_layer.encoder_attn

I had to replace this with self_attn in UnlimiformerLlama, since it's the only kind of attention LLaMA has.

And the issue with the architecture came up when trying to calculate the key:
https://github.com/abertsch72/unlimiformer/blob/5b534d1532246da8ef3fb02bdfff41aa853b12de/src/unlimiformer.py#LL753C30-L753C30
Neither LLaMA nor GPT has any encoders, and so it has no encoder_hidden_states, and attention.k_proj doesn't accept a None value it defaults to.

There's probably some simple solution to this, some other state that should be used in decoder-only models in this case, but I haven't figured it out yet, neither on my own nor re-reading the original paper.

So, any pointers would be welcome.

Can't add any useful insights other than that I'll be watching this issue closely.
I still feel like Unlimiformer can be the biggest leap in decoder models we've seen to date, and it pains me to see it getting so little attention, snowed in by hundreds of other whitepapers and people missing its potential.
From what I understand (which, admittedly, is not much), Unlimiformer effectively beats approaches like Facebook's MEGABYTE in quality, while not being mutually exclusive either.
Aside from "just" long context windows, it should also allow per-character tokenization, which, while it'd cut into the total text size supported significantly, might allow for much better models trained from scratch, and still have longer context size than current models on top.
While I can't add any useful insights on how to add decoder-only model support, I hope I can get more people motivated in helping out by highlighting its potential.

EDIT: This post does kinda go on an unrelated tangent, I do not mean to derail this intended-to-be-technical issue or start any new discussions, I got overexited seeing work on decoder-only support at last. If this post causes derailment I will delete it.

Can't add any useful insights other than that I'll be watching this issue closely. I still feel like Unlimiformer can be the biggest leap in decoder models we've seen to date, and it pains me to see it getting so little attention, snowed in by hundreds of other whitepapers and people missing its potential. From what I understand (which, admittedly, is not much), Unlimiformer effectively beats approaches like Facebook's MEGABYTE in quality, while not being mutually exclusive either. Aside from "just" long context windows, it should also allow per-character tokenization, which, while it'd cut into the total text size supported significantly, might allow for much better models trained from scratch, and still have longer context size than current models on top. While I can't add any useful insights on how to add decoder-only model support, I hope I can get more people motivated in helping out by highlighting its potential.

EDIT: This post does kinda go on an unrelated tangent, I do not mean to derail this intended-to-be-technical issue or start any new discussions, I got overexited seeing work on decoder-only support at last. If this post causes derailment I will delete it.

Oh, you're the guy who opened the original related issue. Hello.

I think people are really impressed by MPT and ALiBi (and probably other things), except ALiBI has to be built-in whereas Unlimiformer can just be used as a wrapper around any model you already have.

I also have to look into Facebook's MEGABYTE, although it's a digression at this point

Hi @StrangeTcy and @SharkWipf ,
Thank you for your interest in Unlimiformer and for your kind words! (@SharkWipf please don't delete your post!)

The huge attention from the community convinced us that we should implement support for decoder-only models :-D

I haven't started working on it yet, so if any of you guys manage to make progress we would love to get contributions, but I think that the major difference is that decoder-only models would require an index per layer, whereas in encoder-decoder models we could have a single index for the encoder_hidden_states. The difference is that in encoder-decoder models - all decoder layers attend to the last encoder layer's output (so we could have a single index), while in decoder-only models - every layer attends only to itself.

In other words, the implementation would require keeping num_layers indexes. However, I do think that we will be able to do the attention trick that we describe in Equation 2 in the paper:

which will allow sharing the layer-specific index across all heads. So we could have num_layers indexes, rather than num_layers * num_heads.

What do you think? does it make sense?

Thanks, I'll have to re-read the paper and think about the whole attention reformulation for a while...

Let this be a placeholder for my actual thoughts...:

The huge attention from the community convinced us that we should implement support for decoder-only models :-D

That's great news!

Unfortunately, much as I'd love to help (and much as I try), I'm still new to this field and have no experience or education in any of the surrounding fields (aside from unrelated sysadmin/backend dev stuff), so catching/keeping up with everything going on is taking all I have atm. I (think I) understand most of the core concepts and implications involved, but the actual implementation is still beyond me I'm afraid.
Best I can do for now is cheer you on and spread the word 😅

No worries @SharkWipf 😄

Hey @SharkWipf and @StrangeTcy ,

Unlimiformer now supports Llama and Llama-2 (and all their derivatives)!

Check it out, and let us know if you have any questions!
https://github.com/abertsch72/unlimiformer#august-2023---unlimiformer-now-supports-llama-2-and-all-its-derivatives

Best,
Uri

Hi, could you provide some details as to how you added support for decoder only models such as Llama?