Time and space complexity appear to be quadratic in sequence length

Question

Time and space complexity appear to be quadratic in sequence length

Opened this issue 6 months ago · 12 comments

Hi there -- I took a quick look at your code. A key motivation for modeling sequences via linear recurrence relations (instead of, say, self-attention) is that they can be implemented to execute with time and space complexity that are linear in sequence length T, O(T). In your code, it appears that the computation of each matrix Z and its multiplication over sequence length T with each X incur time and space complexity quadratic in sequence length T, O(T^2).

Answer 1 · 2023-12-28T23:55:46.000Z

Hello, @fheinsen! Thanks a lot for checking outmy work, I missed the fact that torch.bmm is quadratic w.r.t to the sequence length $T$. However, the approach can be slightly modified to entirely avoid performing bmm operations. I just updated the code, and the README with the more efficient approach, now the complexity is $O(T \log T)$

Answer 2 · 2023-12-29T00:08:39.000Z

Thank you. I'll take a look as soon as I get a chance -- it might take me a while, because I'm traveling.

Answer 3 · 2023-12-29T00:18:21.000Z

Sure! Have a nice trip

Answer 4 · 2023-12-29T08:19:32.000Z

當然！旅行愉快

Hi, can you try to modify llama2.c?

Answer 5 · 2023-12-29T08:26:40.000Z

Hey, sure!

Answer 6 · 2023-12-29T08:30:18.000Z

@win10ogod could you please share the link to the repo if you plan to open source it

Answer 7 · 2023-12-29T08:37:53.000Z

@maximzubkov This is not my project, but if you need it.
Here is the link:https://github.com/karpathy/llama2.c

Answer 8 · 2023-12-29T08:46:32.000Z

@win10ogod Oh, sorry, I misread your first message. Unfortunately integrating into llama.c is not a priority for me now, so I can't promise that I'll do it soon. Right now I'll focus on benchmarking the approach in terms of speed and numerical stability to compare it with other existing counterparts. But if you have time to integrate pscan into llama.c -- feel free to use this codebase

Answer 9 · 2023-12-29T08:55:51.000Z

@win10ogod哦，抱歉，我看錯了你的第一條訊息。不幸的是，融入llama.c並不是我現在的首要任務，所以我不能保證我很快就會做到。現在，我將重點放在速度和數值穩定性方面對該方法進行基準測試，以將其與其他現有的同行進行比較。但如果您有時間將 pscan 整合到llama.c-- 請隨意使用此程式碼庫

I'm a little confused about deep learning.
Also can I ask you to help implement an idea?

Answer 10 · 2023-12-29T09:17:35.000Z

@maximzubkov I believe that it is possible to replace conventional QKV caching with methods similar to multi-agent reinforcement learning, dynamically storing memories. By utilizing dynamically growing memory weights, the training of the model can be divided into handling the training of dynamic weights and transitioning dynamically to static cooling. What are your thoughts on this?
By transferring the weights of static cooling to the hard disk or RAM, it is possible to reduce the usage of VRAM and mitigate the impact on training speed. Additionally, this should enhance memory capabilities during inference, as dynamic memory and forgetting play a role in activated neurons. This is just my idea, as I am a beginner in deep learning.

Answer 11 · 2023-12-29T09:19:26.000Z

@win10ogod could you please open another issue with this question?

Answer 12 · 2023-12-29T09:23:41.000Z

@win10ogod could you please open another issue with this question?

Hello, I have opened a new question.