microsoft/MInference

[NeurIPS'24 Spotlight] To speed up Long-context LLMs' inference, approximate and dynamic sparse calculate the attention, which reduces inference latency by up to 10x for pre-filling on an A100 while maintaining accuracy.

PythonMIT

Pinned issues

[ToDo]: V0.1.6 Iteration Plan

#50 opened 6 months ago by iofu728

Open0

[ToDo]: V0.1.5 Iteration Plan

#27 opened 6 months ago by iofu728

Closed0

Issues

[Question]: The evaluation code of scbench does not match the provided dataset.
#103 opened 16 days ago by rainstorm12
1
[Question]: 'MInferenceConfig' has no attribute 'get_available_kv_types' and 'get_available_attn_types'
#102 opened 17 days ago by rainstorm12
2
[Question]: different eval results compared to the results in paper
#99 opened 23 days ago by unicorneeee
9
[Question]: What are the definitions of the different stages?
#98 opened 23 days ago by crazyofapple
10
[Question]: How to apply MInference on multiple A100 GPUs?
#95 opened a month ago by XiongxiaoL
1
[Question]: How to understand dense_decoding?
#94 opened a month ago by lemyx
1
[Question]: How to understand 32000 in patch.py ?
#93 opened a month ago by lemyx
1
[Question]: when searching the best sparse attention type ,why to caculate the score just pick the 2500 cols?
#92 opened a month ago by unicorneeee
2
[Question]: attn_type="minference" and attn_type= "hf" got different result
#52 opened 6 months ago by qiling1345
2
[Question]:Code related question: Is the search just for the first batch of dataset?
#91 opened a month ago by unicorneeee
2
[Question]: vllm-tp generate can't stop
#90 opened a month ago by unicorneeee
2
[Question]: RuntimeError encountered when trying to reproduce results in needle in a haystack
#88 opened 2 months ago by lepangdan
3
[Question]: How can I reproduce the FullAttention results on the Ruler dataset
#87 opened 2 months ago by LfieLike
1
[Question]: CUDA error: an illegal memory access was encountered when running benchmark_e2e.py
#86 opened 2 months ago by lepangdan
7
[Question]: Discrepancy in Pre-filling Time and Memory Consumption on Single A100
#84 opened 2 months ago by lepangdan
3
[Question]: ModuleNotFoundError: No module named 'minference.cuda'
#45 opened 6 months ago by lai-serena
2
[Feature Request]: Is it possible to get the returned logsumexp in streamingllm forward?
#85 opened 2 months ago by 311dada
0
[Question]: Am I using minference correctly?
#83 opened 2 months ago by YLGH
2
[Question]: analysis of attention scores (too sparse)
#82 opened 3 months ago by wiluen
2
[Question]: Why is every head config saved with "vertical_and_slash"?
#57 opened 5 months ago by fmmoret
3
[Question]: How to think about the vertical lines
#81 opened 3 months ago by YLGH
0
[Question]: Intuition behind kernel search
#80 opened 3 months ago by PiotrNawrot
2
[Bug]: Torch not found: can't install with pip install (Python 3.12, CUDA 12.6 Update 1, PyTorch 2.4.1)
#77 opened 4 months ago by atemerev
2
[Question]: sparsity of minference
#78 opened 4 months ago by susu1210
1
[Bug]: loc("Minference/minference/ops/pit_sparse_flash_attention_v2.py":110:23): error: operation scheduled before its operands
#75 opened 4 months ago by leoyuppieqnew
2
[Question]: Could you provide more examples about other attention usage, e.g., dilated1, streaming, snapkv
#76 opened 4 months ago by gaow0007
1
[Feature Request]: Support LLaVA Model feature request / Low generation speed
#74 opened 4 months ago by ThisisBillhe
1
[Bug]: vllm executor.driver_worker. 'RayWorkerWrapper' object has no attribute 'model_runner'
#67 opened 5 months ago by TPLink32
10
[Question]: what is the speedup of attention kernel of current implemetation?
#73 opened 4 months ago by foreverpiano
1
Performance Degradation when Using MInference with Qwen2-7B-Instruct Model
#71 opened 5 months ago by yumingfan-0219
1
[Question]: Memory measurement
#68 opened 5 months ago by HayeonLee
2
[Bug]: Normalization of QK product to estimate top-k tokens and slashes
#69 opened 5 months ago by PiotrNawrot
3
[Question]: OOM occurs when reproducing end-to-end latency tests with a single A100
#65 opened 5 months ago by HayeonLee
2
[Question]: Confusion about Optimal Search Pattern Configuration
#64 opened 5 months ago by Dianaia
2
[Question]: It seems that minference does not currently support tensor parallelism under vllm, right? Because in a multi-card environment, the head_id here is incorrect compared to a single card
#62 opened 5 months ago by zh2333
2
[Question]: AssertionError: The model /workspace/model/llm/Qwen/Qwen2-7B-Instruct you specified is not supported.
#60 opened 5 months ago by LIUKAI0815
2
Does MInference supports CUDA11.8?
#56 opened 5 months ago by hensiesp32
4
[Bug]: When I'm doing distributed inference using vllm and minference, the example starts with an error when I set tensor_parallel_size to a value greater than 1
#63 opened 5 months ago by zh2333
1
[Question]: RuntimeError: Cannot re-initialize CUDA in forked subprocess. To use CUDA with multiprocessing, you must use the 'spawn' start method
#58 opened 5 months ago by IvanDeng0
3
[Question]: Errors when I reproduce results in Table 5 (MInference + SnapKV) & poor results with attn_type=minference_with_dense
#55 opened 5 months ago by HayeonLee
2
[Question]: How does VLLM use MInference through OpenAI Compatible Server?
#40 opened 6 months ago by jueming0312
2
[ToDo]: V0.1.6 Iteration Plan
#50 opened 6 months ago by iofu728
0
Shape of slash mismatch when input batchsize > 1
#53 opened 6 months ago by polarispw
0
[Question]: FlashAttention only supports Ampere GPUs or newer
#51 opened 6 months ago by qiling1345
3
[Question]: Question about the settings of vertical_size and slash_size in vertical_and_slash pattern
#47 opened 6 months ago by ALUKErnel
4
[Question]: Hi，it seems that Qwen2-7B-Instruct is not support in my case
#48 opened 6 months ago by lwj2001
1
[Question]: run with local models
#41 opened 6 months ago by qiling1345
4
[Question]: Does vertical_slash_sparse_attention supported to concatenate all batches into a single row for operation like flash_attn_2_cuda.varlen_fwd?
#46 opened 6 months ago by Amanda-Barbara
1
[Bug]: NameError: name 'cache_ops' is not defined
#42 opened 6 months ago by Zoro528
2
[Question]: Why is running MInference/examples/run_vllm.py not as fast as running vllm alone?
#43 opened 6 months ago by zjjznw123
1