microsoft/MInference

Does MInference supports CUDA11.8?

hensiesp32 opened this issue · 4 comments

Describe the issue

I am wandering if the MInference support cuda11.8? Our devices don't support cuda12.3

Hi @hensiesp32, thanks for your interest in MInference.

It supports CUDA 11.8. We have released the wheel for CUDA 11.8 at this link. If you have any questions, feel free to leave a comment here.

Thanks for your reply. Well, I want to test the needle-in-a-haystack experiment, I only used one A100-80G,however when the contexts length reach to 300k,it occurred an OOM error. Then i open the kv_cache_cpu,but had the error

CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions

so I want to know how do you test the needle-in-a-haystack with 1M context length? Or can we use multi-gpu to run it?

I run the experiment/benchmarks, but the result showed that MInference can't speed up. I used 4 A100-80G GPUs to get the results,The results is show as belowing:
image

Hi @hensiesp32,

  1. For the benchmark test, the results don't seem to make sense, especially with streamingLLM. Did you use vllm for the measurements? Our experiments were conducted on a single A100 using HF or vllm, detail in https://github.com/microsoft/MInference/tree/main/experiments#minference-benchmark-experiments, and I've received some feedback that the corresponding kernel isn't replaced in multi-card setups. Could you test it on a single A100 for now? We will support multi-card mode in the future.

  2. When testing Needle In A Haystack, I used kv_cache_cpu for over 200K. However, this requires enough CPU memory on your machine—around 300GB for 1M.