microsoft/MInference
To speed up Long-context LLMs' inference, approximate and dynamic sparse calculate the attention, which reduces inference latency by up to 10x for pre-filling on an A100 while maintaining accuracy.
PythonMIT
Pinned issues
Issues
- 8
[Bug]: vllm executor.driver_worker. 'RayWorkerWrapper' object has no attribute 'model_runner'
#67 opened by TPLink32 - 5
[Question]: How can I generate topk_dims_file_path with self-trained model when in searching pattern?
#36 opened by Amanda-Barbara - 1
Performance Degradation when Using MInference with Qwen2-7B-Instruct Model
#71 opened by yumingfan-0219 - 2
[Question]: Memory measurement
#68 opened by HayeonLee - 3
- 2
[Question]: OOM occurs when reproducing end-to-end latency tests with a single A100
#65 opened by HayeonLee - 2
- 2
[Question]: It seems that minference does not currently support tensor parallelism under vllm, right? Because in a multi-card environment, the head_id here is incorrect compared to a single card
#62 opened by zh2333 - 2
[Question]: AssertionError: The model /workspace/model/llm/Qwen/Qwen2-7B-Instruct you specified is not supported.
#60 opened by LIUKAI0815 - 10
[Feature Request]: Support Mistral Model
#39 opened by PatchouliTIS - 4
Does MInference supports CUDA11.8?
#56 opened by hensiesp32 - 1
[Bug]: When I'm doing distributed inference using vllm and minference, the example starts with an error when I set tensor_parallel_size to a value greater than 1
#63 opened by zh2333 - 3
[Question]: RuntimeError: Cannot re-initialize CUDA in forked subprocess. To use CUDA with multiprocessing, you must use the 'spawn' start method
#58 opened by IvanDeng0 - 2
[Question]: Errors when I reproduce results in Table 5 (MInference + SnapKV) & poor results with attn_type=minference_with_dense
#55 opened by HayeonLee - 1
- 2
- 0
[ToDo]: V0.1.5 Iteration Plan
#27 opened by iofu728 - 0
[ToDo]: V0.1.6 Iteration Plan
#50 opened by iofu728 - 1
- 0
- 3
- 4
[Question]: Question about the settings of vertical_size and slash_size in vertical_and_slash pattern
#47 opened by ALUKErnel - 6
- 1
- 4
[Question]: run with local models
#41 opened by qiling1345 - 2
[Question]: I got this error "No CUDA GPUs are available", how can I run with CPU ?
#38 opened by qiling1345 - 1
[Question]: Does vertical_slash_sparse_attention supported to concatenate all batches into a single row for operation like flash_attn_2_cuda.varlen_fwd?
#46 opened by Amanda-Barbara - 1
- 2
[Bug]: NameError: name 'cache_ops' is not defined
#42 opened by Zoro528 - 1
[Question]: Why is running MInference/examples/run_vllm.py not as fast as running vllm alone?
#43 opened by zjjznw123 - 5
- 1
[Question]: "Building wheel for flash_attn (setup.py) " for a long time without any notification
#37 opened by qiling1345 - 2
[Question]: Is A6000 supported?
#23 opened by yawzhe - 1
- 1
[Question]: RuntimeError: indices should be either on cpu or on the same device as the indexed tensor (cuda:1)
#25 opened by LiweiPE - 2
[Bug]: missing warnings import in `setup.py`
#28 opened by devangvin - 1
- 5
[Question]: Question about KV-cache storage
#20 opened by DerrickYLJ - 1
[Feature Request]: Could you please support microsoft/Phi-3-medium-128k-instruct? Thank you!
#26 opened by cckao - 6
[Question]: python run_vllm.py TypeError: 'type' object is not subscriptable
#13 opened by junior-zsy - 1
- 1
[Question]: vertical slash pattern
#21 opened by SimJeg - 4
[Question]: Some questions on the code
#17 opened by cyLi-Tiger - 5
[Question]: For the tests such as RULER and InfiniteBench mentioned in the paper, what datasets are used to search for patterns?
#16 opened by hijkzzz - 1
[Question]: pip install minference error: cannot import name 'packaging' from 'pkg_resources'
#12 opened by junior-zsy - 2
[Question]: Hope to supplement the situation of increasing HBM usage with the context.
#7 opened by Arcmoon-Hu - 3
Action required: migrate or opt-out of migration to GitHub inside Microsoft
#1 opened by microsoft-github-policy-service