[Roadmap] vLLM production stack roadmap for 2025 Q1

Question

[Roadmap] vLLM production stack roadmap for 2025 Q1

ApostaC opened this issue 8 months ago · 22 comments

Answer 1 · 2025-01-28T01:14:53.000Z

I think if we have the refactor in the roadmap:

(P2) Transcode the router using a more performant language (e.g., Rust and Go) for better QPS/throughput and lower delay

It would be best to tackle this as soon as possible. As the router grows, it’s only going to get trickier. For the language, I’d go with Go since it meshes well with Kubernetes and monitoring. Go’s k8s client library is way more mature, while Rust's support for Kubernetes isn’t great.

Answer 2 · 2025-01-28T10:42:13.000Z

(P0) format checker for the code

I can help bring over the new formatting from vllm if you'd like? It's much simpler than format.sh and (as long as it gets installed) you can't forget to run it!

Answer 3 · 2025-01-28T17:21:39.000Z

It would be best to tackle this as soon as possible. As the router grows, it’s only going to get trickier. For the language, I’d go with Go since it meshes well with Kubernetes and monitoring. Go’s k8s client library is way more mature, while Rust's support for Kubernetes isn’t great.

@gaocegege I see your point. One thing that makes me a bit hesitant is that python is more friendly to the LLM community.
The current plan is to first have a performance benchmark for the router and see how "bad" the current python version is.

Another backup solution in my mind is we have a Go backbone for the data plane but have some python-interface for the routing logics, so that the community (including both industry and academia) can contribute to it.

Answer 4 · 2025-01-28T17:22:24.000Z

I can help bring over the new formatting from vllm if you'd like? It's much simpler than format.sh and (as long as it gets installed) you can't forget to run it!

@hmellor Thanks for chiming in! Would love to see your contribution! Feel free to create a new issue or PR for this!

Answer 5 · 2025-01-29T01:14:13.000Z

It would be best to tackle this as soon as possible. As the router grows, it’s only going to get trickier. For the language, I’d go with Go since it meshes well with Kubernetes and monitoring. Go’s k8s client library is way more mature, while Rust's support for Kubernetes isn’t great.

@gaocegege I see your point. One thing that makes me a bit hesitant is that python is more friendly to the LLM community. The current plan is to first have a performance benchmark for the router and see how "bad" the current python version is.

Another backup solution in my mind is we have a Go backbone for the data plane but have some python-interface for the routing logics, so that the community (including both industry and academia) can contribute to it.

SGTM.

Answer 6 · 2025-01-29T12:42:41.000Z

@ApostaC great! Please take a look at #35

Answer 7 · 2025-01-29T13:30:46.000Z

I'm curious if you see the evolution of production stack towards an operator. I see someone already suggested CRDs here: #7

With the growth of complexity I would see this stack evolving and becoming more k8s native to simplify operations even more.

Answer 8 · 2025-01-29T17:39:58.000Z

I'm curious if you see the evolution of production stack towards an operator.

Thanks for bring up the question @spron-in .
Right now, the stack is relatively simple so we don’t have immediate plans to do this. Also, we hope that the components in this stack can also be directly ported for different purposes.

But we will definitely consider a more end-to-end solution to simplify the operations when the stack grows more complicated.

Answer 9 · 2025-02-02T03:09:48.000Z

Looks like we can't change resources(cpu,memory) of vllm-stack-deployment-router deployment
Examples of bringing your own model would be helpful - pytorch preferred
Shall we have image tag with specific version instead of latest for every helm releases?

Answer 10 · 2025-02-03T15:10:06.000Z

Hey @nithin8702 , thanks for your interest. Towards your questions:

Looks like we can't change resources(cpu,memory) of vllm-stack-deployment-router deployment

This should be done in #38

Examples of bringing your own model would be helpful - pytorch preferred

Do you mean load models from some local storage? Please refer to this tutorial.

Shall we have image tag with specific version instead of latest for every helm releases?

Can you elaborate a bit more about this question? I'm not sure I get what you want.

Answer 11 · 2025-02-05T17:52:52.000Z

Also interested in seeing least request load balancing!

Answer 12 · 2025-02-06T15:30:38.000Z

I'm currently looking at this (P1) Router observability (Current QPS, router-side queueing delay, number of pending / prefilling / decoding requests, average prefill / decoding length, etc).

Do we have like a list of default observability metrics that we want to have yet: eg the list in the requirement, or you know let's just kinda do some research and pick out some of them for the features and use as default for now ?

I'm planning to contribute to this one

Answer 13 · 2025-02-06T18:59:47.000Z

Also interested in seeing least request load balancing!

@AlexXi19 Hey Alex, we are discussing this in #59, and Kuntai (@KuntaiDu) is currently designing and implementing the functionality in the router.

Answer 14 · 2025-02-06T19:08:19.000Z

I'm currently looking at this (P1) Router observability (Current QPS, router-side queueing delay, number of pending / prefilling / decoding requests, average prefill / decoding length, etc).

Do we have like a list of default observability metrics that we want to have yet: eg the list in the requirement, or you know let's just kinda do some research and pick out some of them for the features and use as default for now ?

I'm planning to contribute to this one

Thanks @sitloboi2012 ! Currently the router does maintain a list of stats internally. We can first open the interface and dump those metrics to Prometheus.

Would you like to open an issue for further discussion?

Answer 15 · 2025-02-07T04:44:52.000Z

I'm currently looking at this (P1) Router observability (Current QPS, router-side queueing delay, number of pending / prefilling / decoding requests, average prefill / decoding length, etc).
Do we have like a list of default observability metrics that we want to have yet: eg the list in the requirement, or you know let's just kinda do some research and pick out some of them for the features and use as default for now ?
I'm planning to contribute to this one

Thanks @sitloboi2012 ! Currently the router does maintain a list of stats internally. We can first open the interface and dump those metrics to Prometheus.

Would you like to open an issue for further discussion?

Yep, let's move this to this Issue: Feat: Router Observability @ApostaC

Answer 16 · 2025-02-13T06:00:52.000Z

Hi Team

Shall we have release notes for every release?

Answer 17 · 2025-02-13T16:29:27.000Z

Hi Team

Shall we have release notes for every release?

Good question! Will do that soon. @Shaoting-Feng @YuhanLiu11 Take a note please?

Answer 18 · 2025-02-13T17:40:35.000Z

Does this support multi-tenancy or namespace isolation ?

Answer 19 · 2025-02-13T17:49:23.000Z

Does this support multi-tenancy or namespace isolation ?

You can specify the namespace when doing helm install

[Roadmap] vLLM production stack roadmap for 2025 Q1

Core features

CI/CD and packaging

OSS-related supports