vllm-project/production-stack

[Roadmap] vLLM production stack roadmap for 2025 Q1

ApostaC opened this issue · 22 comments

This project's scope involves a set of production-related modules around vLLM, including router, autoscaling, observability, KV cache offloading, and framework supports (KServe, Ray, etc).

This document will include the items on our Q1 roadmap. We will keep updating this document to include the related issues, pull requests, and discussions in the #production-stack channel in the vLLM slack.

Core features

  • (P0) Prefix-cache-aware routing algorithm (#19)
  • (P1) Offline batched inference based on OpenAI offline batching API
    • Part 1: file storage support (#47 , #52 )
    • Part 2: batched inference API support
  • (P1) Router observability (Current QPS, router-side queueing delay, number of pending / prefilling / decoding requests, average prefill / decoding length, etc) (#78, #119 )
  • (P1) Autoscaling support
  • (P2) Experimental support for disaggregated prefill
  • (P2) Support vLLM v1
  • (P2) Transcode the router using a more performant language (e.g., Rust and Go) for better QPS/throughput and lower delay

CI/CD and packaging

  • (P0) Add unit test to the repo (#24)
  • (P0) Add end-to-end test of the deployment (#30)
  • (P0) Automatically release the helm charts and the router docker images (#23, #74 )
  • (P1) Package the router into a separate python package vllm-router (#17)

OSS-related supports

  • (P0) format checker for the code (#35)
  • (P2) Issue and PR templates and labels (#93)

If any of the items you wanted are not on the roadmap, your suggestion and contribution are strongly welcomed! Please feel free to comment in this thread, open a feature request, or create an RFC.

Happy vLLMing!

I think if we have the refactor in the roadmap:

(P2) Transcode the router using a more performant language (e.g., Rust and Go) for better QPS/throughput and lower delay

It would be best to tackle this as soon as possible. As the router grows, it’s only going to get trickier. For the language, I’d go with Go since it meshes well with Kubernetes and monitoring. Go’s k8s client library is way more mature, while Rust's support for Kubernetes isn’t great.

(P0) format checker for the code

I can help bring over the new formatting from vllm if you'd like? It's much simpler than format.sh and (as long as it gets installed) you can't forget to run it!

It would be best to tackle this as soon as possible. As the router grows, it’s only going to get trickier. For the language, I’d go with Go since it meshes well with Kubernetes and monitoring. Go’s k8s client library is way more mature, while Rust's support for Kubernetes isn’t great.

@gaocegege I see your point. One thing that makes me a bit hesitant is that python is more friendly to the LLM community.
The current plan is to first have a performance benchmark for the router and see how "bad" the current python version is.

Another backup solution in my mind is we have a Go backbone for the data plane but have some python-interface for the routing logics, so that the community (including both industry and academia) can contribute to it.

I can help bring over the new formatting from vllm if you'd like? It's much simpler than format.sh and (as long as it gets installed) you can't forget to run it!

@hmellor Thanks for chiming in! Would love to see your contribution! Feel free to create a new issue or PR for this!

It would be best to tackle this as soon as possible. As the router grows, it’s only going to get trickier. For the language, I’d go with Go since it meshes well with Kubernetes and monitoring. Go’s k8s client library is way more mature, while Rust's support for Kubernetes isn’t great.

@gaocegege I see your point. One thing that makes me a bit hesitant is that python is more friendly to the LLM community. The current plan is to first have a performance benchmark for the router and see how "bad" the current python version is.

Another backup solution in my mind is we have a Go backbone for the data plane but have some python-interface for the routing logics, so that the community (including both industry and academia) can contribute to it.

SGTM.

@ApostaC great! Please take a look at #35

I'm curious if you see the evolution of production stack towards an operator. I see someone already suggested CRDs here: #7

With the growth of complexity I would see this stack evolving and becoming more k8s native to simplify operations even more.

I'm curious if you see the evolution of production stack towards an operator.

Thanks for bring up the question @spron-in .
Right now, the stack is relatively simple so we don’t have immediate plans to do this. Also, we hope that the components in this stack can also be directly ported for different purposes.

But we will definitely consider a more end-to-end solution to simplify the operations when the stack grows more complicated.

  • Looks like we can't change resources(cpu,memory) of vllm-stack-deployment-router deployment
  • Examples of bringing your own model would be helpful - pytorch preferred
  • Shall we have image tag with specific version instead of latest for every helm releases?

Hey @nithin8702 , thanks for your interest. Towards your questions:


  • Looks like we can't change resources(cpu,memory) of vllm-stack-deployment-router deployment

This should be done in #38


  • Examples of bringing your own model would be helpful - pytorch preferred

Do you mean load models from some local storage? Please refer to this tutorial.


  • Shall we have image tag with specific version instead of latest for every helm releases?

Can you elaborate a bit more about this question? I'm not sure I get what you want.

Also interested in seeing least request load balancing!

I'm currently looking at this (P1) Router observability (Current QPS, router-side queueing delay, number of pending / prefilling / decoding requests, average prefill / decoding length, etc).

Do we have like a list of default observability metrics that we want to have yet: eg the list in the requirement, or you know let's just kinda do some research and pick out some of them for the features and use as default for now ?

I'm planning to contribute to this one

Also interested in seeing least request load balancing!

@AlexXi19 Hey Alex, we are discussing this in #59, and Kuntai (@KuntaiDu) is currently designing and implementing the functionality in the router.

I'm currently looking at this (P1) Router observability (Current QPS, router-side queueing delay, number of pending / prefilling / decoding requests, average prefill / decoding length, etc).

Do we have like a list of default observability metrics that we want to have yet: eg the list in the requirement, or you know let's just kinda do some research and pick out some of them for the features and use as default for now ?

I'm planning to contribute to this one

Thanks @sitloboi2012 ! Currently the router does maintain a list of stats internally. We can first open the interface and dump those metrics to Prometheus.

Would you like to open an issue for further discussion?

I'm currently looking at this (P1) Router observability (Current QPS, router-side queueing delay, number of pending / prefilling / decoding requests, average prefill / decoding length, etc).
Do we have like a list of default observability metrics that we want to have yet: eg the list in the requirement, or you know let's just kinda do some research and pick out some of them for the features and use as default for now ?
I'm planning to contribute to this one

Thanks @sitloboi2012 ! Currently the router does maintain a list of stats internally. We can first open the interface and dump those metrics to Prometheus.

Would you like to open an issue for further discussion?

Yep, let's move this to this Issue: Feat: Router Observability @ApostaC

Hi Team

Shall we have release notes for every release?

Hi Team

Shall we have release notes for every release?

Good question! Will do that soon. @Shaoting-Feng @YuhanLiu11 Take a note please?

Does this support multi-tenancy or namespace isolation ?

Does this support multi-tenancy or namespace isolation ?

You can specify the namespace when doing helm install