Does MInference supports CUDA11.8?
hensiesp32 opened this issue · 4 comments
Describe the issue
I am wandering if the MInference support cuda11.8? Our devices don't support cuda12.3
Hi @hensiesp32, thanks for your interest in MInference.
It supports CUDA 11.8. We have released the wheel for CUDA 11.8 at this link. If you have any questions, feel free to leave a comment here.
Thanks for your reply. Well, I want to test the needle-in-a-haystack experiment, I only used one A100-80G,however when the contexts length reach to 300k,it occurred an OOM error. Then i open the kv_cache_cpu
,but had the error
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions
so I want to know how do you test the needle-in-a-haystack with 1M context length? Or can we use multi-gpu to run it?
Hi @hensiesp32,
-
For the benchmark test, the results don't seem to make sense, especially with streamingLLM. Did you use vllm for the measurements? Our experiments were conducted on a single A100 using HF or vllm, detail in https://github.com/microsoft/MInference/tree/main/experiments#minference-benchmark-experiments, and I've received some feedback that the corresponding kernel isn't replaced in multi-card setups. Could you test it on a single A100 for now? We will support multi-card mode in the future.
-
When testing Needle In A Haystack, I used
kv_cache_cpu
for over 200K. However, this requires enough CPU memory on your machine—around 300GB for 1M.