Does MInference supports CUDA11.8?

Question

Does MInference supports CUDA11.8?

hensiesp32 opened this issue 2 months ago · 4 comments

Describe the issue

I am wandering if the MInference support cuda11.8? Our devices don't support cuda12.3

Answer 1 · 2024-07-30T08:14:29.000Z

Hi @hensiesp32, thanks for your interest in MInference.

It supports CUDA 11.8. We have released the wheel for CUDA 11.8 at this link. If you have any questions, feel free to leave a comment here.

Answer 2 · 2024-08-01T08:27:43.000Z

Thanks for your reply. Well, I want to test the needle-in-a-haystack experiment, I only used one A100-80G，however when the contexts length reach to 300k，it occurred an OOM error. Then i open the kv_cache_cpu，but had the error

CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions

so I want to know how do you test the needle-in-a-haystack with 1M context length? Or can we use multi-gpu to run it?

Answer 3 · 2024-08-05T06:30:40.000Z

I run the experiment/benchmarks, but the result showed that MInference can't speed up. I used 4 A100-80G GPUs to get the results，The results is show as belowing:

Answer 4 · 2024-08-05T08:23:11.000Z

Hi @hensiesp32,

For the benchmark test, the results don't seem to make sense, especially with streamingLLM. Did you use vllm for the measurements? Our experiments were conducted on a single A100 using HF or vllm, detail in https://github.com/microsoft/MInference/tree/main/experiments#minference-benchmark-experiments, and I've received some feedback that the corresponding kernel isn't replaced in multi-card setups. Could you test it on a single A100 for now? We will support multi-card mode in the future.
When testing Needle In A Haystack, I used kv_cache_cpu for over 200K. However, this requires enough CPU memory on your machine—around 300GB for 1M.