IBM/ModuleFormer

Context length

Closed this issue · 3 comments

Great job, it reads very interesting.
While finetuning i wondered if the sequence length can be increased.
For example with lora MoE approach we can use yarn and rope the extend the context length.

Do you have any idea for that or maybe already training on longer context sizes ?

Have you any information about flash attention etc for memory efficiency on larger sequences ?

Hi, the model uses stick-breaking attention, which has no position embedding/bias. You could increase the sequence length as long as you want during fine-tuning, and no change of code is needed. The current model works best with context lengths up to 2048.

Hi, thanks for the response.
Do you have some benchmarks about memory consumption?

We didn't benchmark the memory consumption. But the model only activates 2 to 4 heads per token, the memory consumption should be better or at least comparable to other models.