[Question]: Some questions on the code
cyLi-Tiger opened this issue · 4 comments
Describe the issue
Hi, thanks for your great work!
I have some questions about your code:
-
In search_pattern, the search space is different from Table.6 in the paper? And the search space is generated base on the fixed FLOPs?
-
In the following code snippets of
search_pattern
, everybest_ty
is reassigned to "vertical_and_slash", so onlyvertical_slash_sparse_attention
will be called? Besides, forbest_ty == "block_sparse"
, where does the magic number1000
and6096
come from?
if best_ty == "stream_llm": best_ty = "vertical_and_slash" if best_ty == "block_sparse": best_ty, best_v, best_s = "vertical_and_slash", 1000, 6096
- In the paper, you mentioned one sample is enough to capture the pattern for different prompts with different lengths, any support for that?
we use only one sample as our validation set from KV retrieval synthetic data with 30k token inputs, which exhibits strong generalization and stability across different lengths and domains.
Besides, I think I might find a bug when I run examples/run_hf_streaming.sh
with Qwen2-7B-Instruct. In apply_rotary_embed_single, you need to enable position_ids
since q_len
== 1 during decoding stage. I change the code into cos = cos[position_ids] sin = sin[position_ids]
and it's fine now.
A bit wordy... Looking forward to your reply!
For the 10× for pre-filling on an A100, any scripts to reproduce that?
Hi @cyLi-Tiger, thanks for your detailed questions.
- Yes, Table 6 provides a search space based on FLOPs, and the code currently uses a fine-tuned version within this space.
- The config we have released is the only vertical+slash version, which has relatively good generalization. In our tests, using "stream_llm" and "block_sparse" resulted in performance loss for some lengths and tasks. The values "1000" and "6096" were determined based on empirical experiments.
- We have indeed conducted experiments to find the optimal sparse attention pattern, including multiple examples and tasks. The results show that using only one example can yield good sparse attention pattern search results, but this search needs to be conducted on sufficiently dynamic tasks, such as KV retrieval tasks. All our experimental results use sparse pattern config searched from a single example, as referenced in Offline-Kernel-Aware-Sparse-Pattern-Search.
- Thank you for pointing out this issue. It is mainly caused by the transformers version, and we will fix this issue in the next release. However, I am curious as run_hf_streaming.py should not call this logic. Did you use the kv_cache_cpu=True parameter?
- You can refer to the End-to-End Benchmark guideline to reproduce the latency results.
Thanks again for your questions.
For the 4th point, your code set kv_cache_cpu=True
by default. I think that this is because the prompt in this script is relative long, ~777k. And in the needle_in_the_haysack, kv cache is also move to cpu for longer prompts, so is such behavior recommended or there are other reasons?
For the 4th point, your code set
kv_cache_cpu=True
by default. I think that this is because the prompt in this script is relative long, ~777k. And in the needle_in_the_haysack, kv cache is also move to cpu for longer prompts, so is such behavior recommended or there are other reasons?
You're right, and we will fix this issue in the next version. The hf_streaming
version was intended for building a demo video. To perform 1M inference on a single A100, we loaded the KV cache to the CPU. Based on our experience, for models like LLaMA-3-8B, prompts larger than 200K require this approach.