maximzubkov/fft-scan

Time and space complexity appear to be quadratic in sequence length

Opened this issue · 12 comments

Hi there -- I took a quick look at your code. A key motivation for modeling sequences via linear recurrence relations (instead of, say, self-attention) is that they can be implemented to execute with time and space complexity that are linear in sequence length T, O(T). In your code, it appears that the computation of each matrix Z and its multiplication over sequence length T with each X incur time and space complexity quadratic in sequence length T, O(T^2).

Hello, @fheinsen! Thanks a lot for checking outmy work, I missed the fact that torch.bmm is quadratic w.r.t to the sequence length $T$. However, the approach can be slightly modified to entirely avoid performing bmm operations. I just updated the code, and the README with the more efficient approach, now the complexity is $O(T \log T)$

Thank you. I'll take a look as soon as I get a chance -- it might take me a while, because I'm traveling.

Sure! Have a nice trip

當然!旅行愉快

Hi, can you try to modify llama2.c?

Hey, sure!

@win10ogod could you please share the link to the repo if you plan to open source it

@maximzubkov This is not my project, but if you need it.
Here is the link:https://github.com/karpathy/llama2.c

@win10ogod Oh, sorry, I misread your first message. Unfortunately integrating into llama.c is not a priority for me now, so I can't promise that I'll do it soon. Right now I'll focus on benchmarking the approach in terms of speed and numerical stability to compare it with other existing counterparts. But if you have time to integrate pscan into llama.c -- feel free to use this codebase

@win10ogod哦,抱歉,我看錯了你的第一條訊息。不幸的是,融入llama.c並不是我現在的首要任務,所以我不能保證我很快就會做到。現在,我將重點放在速度和數值穩定性方面對該方法進行基準測試,以將其與其他現有的同行進行比較。但如果您有時間將 pscan 整合到llama.c-- 請隨意使用此程式碼庫

I'm a little confused about deep learning.
Also can I ask you to help implement an idea?

@maximzubkov I believe that it is possible to replace conventional QKV caching with methods similar to multi-agent reinforcement learning, dynamically storing memories. By utilizing dynamically growing memory weights, the training of the model can be divided into handling the training of dynamic weights and transitioning dynamically to static cooling. What are your thoughts on this?
By transferring the weights of static cooling to the hard disk or RAM, it is possible to reduce the usage of VRAM and mitigate the impact on training speed. Additionally, this should enhance memory capabilities during inference, as dynamic memory and forgetting play a role in activated neurons. This is just my idea, as I am a beginner in deep learning.

@win10ogod could you please open another issue with this question?

@win10ogod could you please open another issue with this question?

Hello, I have opened a new question.