kubeagi/arcadia

vllm+KubeRay deployment, streaming is very slow

AIApprentice101 opened this issue · 5 comments

Thank you for the great repo. I followed the instruction to deploy Mistral7B-AWQ model using vLLM and KubeRay to GCP k8s clusters. What I find is that for the exact same request (temperature=0 for reproducibility), streaming takes much longer compared to regular request, especially when the decoding is lengthy.

I can't reproduce this in my local deployment, so I suspect it's an issue with Ray cluster? Any help would be much appreciated. Thank you.

Hi @AIApprentice101 .Can you should provide more details like :

  • how you create your ray cluster
  • how you use vllm with ray.

@AIApprentice101 http://kubeagi.k8s.com.cn/docs/Performance/distributed-inference, not sure if you're using distributed inference with multiple GPUs across nodes, then the performance might be bad.

@bjwswang @nkwangleiGIT Thank you for your prompt response. I'm doing very basic stuff, using 1 L4 GPU to serve the model. Here're the vllm configs I'm using.

llm_model_name = "Mistral-7B-Instruct-v0.2-AWQ"
tensor_parallel_size = "1"
gpu_memory_utilization = "0.9"
quantization = "awq"
worker_use_ray = "false"
max_model_len = "19456"
enable_prefix_caching = "true"
max_num_seqs = "64"
enforce_eager = "false"

I closely follow the instruction in this link: http://kubeagi.k8s.com.cn/docs/Configuration/DistributedInference/deploy-using-rary-serve. The only modification I did is to use the vllm/vllm-openai:v0.4.2 image rather than the 0.4.1, since the patch you guys did has been fixed in v0.4.2 (vllm-project/vllm#2683).

In terms of the Ray cluster environment, we have it setup by the SRE on GCP, so I don't have much to say about that. One question though: in your setup, you specify GPU for both the Ray head and worker nodes. Any particular reason that we need GPU in the head node?

What I observe is that non-streaming response (stream=False) in the Ray cluster performs very close to my local deployment using serve run. But the streaming is abnormally slow in the Ray cluster deployment. This is very obvious for long-decoding tasks (e.g. Write me a long essay with at least 20 paragraphs).

Any directions or suggestions would be appreciated. Thank you.

Any particular reason that we need GPU in the head node?

No particular reason, just simplify the test environment and try how Ray works for serving and autoscaling.

Not sure if the performance issue is caused by vllm, and I didn't notice this when I test it.
So you can try inference without vllm and see how it performs in streaming mode.

It turns out to be the issue of the setup on our end. Thank you for your help.