microsoft/MInference
[NeurIPS'24 Spotlight] To speed up Long-context LLMs' inference, approximate and dynamic sparse calculate the attention, which reduces inference latency by up to 10x for pre-filling on an A100 while maintaining accuracy.
PythonMIT
Pinned issues
Issues
- 1
[Question]: The evaluation code of scbench does not match the provided dataset.
#103 opened by rainstorm12 - 2
[Question]: 'MInferenceConfig' has no attribute 'get_available_kv_types' and 'get_available_attn_types'
#102 opened by rainstorm12 - 9
- 10
- 1
- 1
[Question]: How to understand dense_decoding?
#94 opened by lemyx - 1
[Question]: How to understand 32000 in patch.py ?
#93 opened by lemyx - 2
[Question]: when searching the best sparse attention type ,why to caculate the score just pick the 2500 cols?
#92 opened by unicorneeee - 2
- 2
[Question]:Code related question: Is the search just for the first batch of dataset?
#91 opened by unicorneeee - 2
[Question]: vllm-tp generate can't stop
#90 opened by unicorneeee - 3
[Question]: RuntimeError encountered when trying to reproduce results in needle in a haystack
#88 opened by lepangdan - 1
[Question]: How can I reproduce the FullAttention results on the Ruler dataset
#87 opened by LfieLike - 7
[Question]: CUDA error: an illegal memory access was encountered when running benchmark_e2e.py
#86 opened by lepangdan - 3
[Question]: Discrepancy in Pre-filling Time and Memory Consumption on Single A100
#84 opened by lepangdan - 2
- 0
[Feature Request]: Is it possible to get the returned logsumexp in streamingllm forward?
#85 opened by 311dada - 2
[Question]: Am I using minference correctly?
#83 opened by YLGH - 2
- 3
- 0
[Question]: How to think about the vertical lines
#81 opened by YLGH - 2
[Question]: Intuition behind kernel search
#80 opened by PiotrNawrot - 2
[Bug]: Torch not found: can't install with pip install (Python 3.12, CUDA 12.6 Update 1, PyTorch 2.4.1)
#77 opened by atemerev - 1
[Question]: sparsity of minference
#78 opened by susu1210 - 2
[Bug]: loc("Minference/minference/ops/pit_sparse_flash_attention_v2.py":110:23): error: operation scheduled before its operands
#75 opened by leoyuppieqnew - 1
[Question]: Could you provide more examples about other attention usage, e.g., dilated1, streaming, snapkv
#76 opened by gaow0007 - 1
[Feature Request]: Support LLaVA Model feature request / Low generation speed
#74 opened by ThisisBillhe - 10
[Bug]: vllm executor.driver_worker. 'RayWorkerWrapper' object has no attribute 'model_runner'
#67 opened by TPLink32 - 1
[Question]: what is the speedup of attention kernel of current implemetation?
#73 opened by foreverpiano - 1
Performance Degradation when Using MInference with Qwen2-7B-Instruct Model
#71 opened by yumingfan-0219 - 2
[Question]: Memory measurement
#68 opened by HayeonLee - 3
- 2
[Question]: OOM occurs when reproducing end-to-end latency tests with a single A100
#65 opened by HayeonLee - 2
- 2
[Question]: It seems that minference does not currently support tensor parallelism under vllm, right? Because in a multi-card environment, the head_id here is incorrect compared to a single card
#62 opened by zh2333 - 2
[Question]: AssertionError: The model /workspace/model/llm/Qwen/Qwen2-7B-Instruct you specified is not supported.
#60 opened by LIUKAI0815 - 4
Does MInference supports CUDA11.8?
#56 opened by hensiesp32 - 1
[Bug]: When I'm doing distributed inference using vllm and minference, the example starts with an error when I set tensor_parallel_size to a value greater than 1
#63 opened by zh2333 - 3
[Question]: RuntimeError: Cannot re-initialize CUDA in forked subprocess. To use CUDA with multiprocessing, you must use the 'spawn' start method
#58 opened by IvanDeng0 - 2
[Question]: Errors when I reproduce results in Table 5 (MInference + SnapKV) & poor results with attn_type=minference_with_dense
#55 opened by HayeonLee - 2
- 0
[ToDo]: V0.1.6 Iteration Plan
#50 opened by iofu728 - 0
- 3
- 4
[Question]: Question about the settings of vertical_size and slash_size in vertical_and_slash pattern
#47 opened by ALUKErnel - 1
- 4
[Question]: run with local models
#41 opened by qiling1345 - 1
[Question]: Does vertical_slash_sparse_attention supported to concatenate all batches into a single row for operation like flash_attn_2_cuda.varlen_fwd?
#46 opened by Amanda-Barbara - 2
[Bug]: NameError: name 'cache_ops' is not defined
#42 opened by Zoro528 - 1
[Question]: Why is running MInference/examples/run_vllm.py not as fast as running vllm alone?
#43 opened by zjjznw123