microsoft/MInference

[Question]: Some questions on the code

cyLi-Tiger opened this issue · 4 comments

Describe the issue

Hi, thanks for your great work!

I have some questions about your code:

  1. In search_pattern, the search space is different from Table.6 in the paper? And the search space is generated base on the fixed FLOPs?

  2. In the following code snippets of search_pattern, every best_ty is reassigned to "vertical_and_slash", so only vertical_slash_sparse_attention will be called? Besides, for best_ty == "block_sparse", where does the magic number 1000 and 6096 come from?

if best_ty == "stream_llm":
    best_ty = "vertical_and_slash"
if best_ty == "block_sparse":
    best_ty, best_v, best_s = "vertical_and_slash", 1000, 6096
  1. In the paper, you mentioned one sample is enough to capture the pattern for different prompts with different lengths, any support for that?

we use only one sample as our validation set from KV retrieval synthetic data with 30k token inputs, which exhibits strong generalization and stability across different lengths and domains.

Besides, I think I might find a bug when I run examples/run_hf_streaming.sh with Qwen2-7B-Instruct. In apply_rotary_embed_single, you need to enable position_ids since q_len == 1 during decoding stage. I change the code into cos = cos[position_ids] sin = sin[position_ids] and it's fine now.

A bit wordy... Looking forward to your reply!

Tasks

No tasks being tracked yet.

For the 10× for pre-filling on an A100, any scripts to reproduce that?

Hi @cyLi-Tiger, thanks for your detailed questions.

  1. Yes, Table 6 provides a search space based on FLOPs, and the code currently uses a fine-tuned version within this space.
  2. The config we have released is the only vertical+slash version, which has relatively good generalization. In our tests, using "stream_llm" and "block_sparse" resulted in performance loss for some lengths and tasks. The values "1000" and "6096" were determined based on empirical experiments.
  3. We have indeed conducted experiments to find the optimal sparse attention pattern, including multiple examples and tasks. The results show that using only one example can yield good sparse attention pattern search results, but this search needs to be conducted on sufficiently dynamic tasks, such as KV retrieval tasks. All our experimental results use sparse pattern config searched from a single example, as referenced in Offline-Kernel-Aware-Sparse-Pattern-Search.
  4. Thank you for pointing out this issue. It is mainly caused by the transformers version, and we will fix this issue in the next release. However, I am curious as run_hf_streaming.py should not call this logic. Did you use the kv_cache_cpu=True parameter?
  5. You can refer to the End-to-End Benchmark guideline to reproduce the latency results.

Thanks again for your questions.

For the 4th point, your code set kv_cache_cpu=True by default. I think that this is because the prompt in this script is relative long, ~777k. And in the needle_in_the_haysack, kv cache is also move to cpu for longer prompts, so is such behavior recommended or there are other reasons?

For the 4th point, your code set kv_cache_cpu=True by default. I think that this is because the prompt in this script is relative long, ~777k. And in the needle_in_the_haysack, kv cache is also move to cpu for longer prompts, so is such behavior recommended or there are other reasons?

You're right, and we will fix this issue in the next version. The hf_streaming version was intended for building a demo video. To perform 1M inference on a single A100, we loaded the KV cache to the CPU. Based on our experience, for models like LLaMA-3-8B, prompts larger than 200K require this approach.