Fake load balancing?
Closed this issue · 15 comments
System Info / 系統信息
cuda 12.4 4090D
Running Xinference with Docker? / 是否使用 Docker 运行 Xinfernece?
- docker / docker
- pip install / 通过 pip install 安装
- installation from source / 从源码安装
Version info / 版本信息
0.16.0
The command used to start Xinference / 用以启动 xinference 的命令
分布式部署
Reproduction / 复现过程
The model can only be deployed on a single node, and not on multiple nodes, leading to other nodes being unable to share the load。假的负载均衡?模型只能部署在单一节点,而不能多个节点,导致其他节点无法分担负载。尽管我有两个worker节点,但是我只能部署使用其中一个节点,无法在另外的节点使用相同的uid部署同样的模型,对外提供统一服务。
Expected behavior / 期待表现
真正的负载均衡。
Now you need to specify replica=2
to share the same uid.
even do that, they still in the same node.I want balance to another node.
Now you need to specify
replica=2
to share the same uid.
I want use the same uid in two different worker node.
Now you need to specify
replica=2
to share the same uid.I want use the same uid in two different worker node.
Now the worker_ip and gpu idexes are not well designed for multiple replicas, actually internally there is no limitation.
so xinference can't reach my requirements. and they have no load banlace.
Now you need to specify
replica=2
to share the same uid.I want use the same uid in two different worker node.
Now the worker_ip and gpu idexes are not well designed for multiple replicas, actually internally there is no limitation.
so xinference can't reach my requirements. and they have no load banlace.
any plan or road map for this?
Now you need to specify
replica=2
to share the same uid.I want use the same uid in two different worker node.
Now the worker_ip and gpu idexes are not well designed for multiple replicas, actually internally there is no limitation.
any plan or road map for this?
@ck7colin 如果要实现真正的负载均衡估计得有个类似网关的组件吧,启动的组件里好像没这样的东西
现状是:
- replica>1 的时候,supervisor 会自动分配副本。这个时候是负载均衡的。
- replica==1,可以通过指定 worker_ip 和 gpu idx 精确分配。
p.s. 我主要根据问问题的人主要用的英语还是中文回答问题,尤其是标题。提 issue 的不都是**人。
这个我只解释一次。
@qinxuye 现在的负载均衡支持配置为根据运行的请求数量来均衡吗, 让执行中请求较少的模型来运行新的请求
目前在一台机器多卡多replica,经常有因为一些请求上下文长输出的比较慢 被分配的越来越多
@qinxuye 现在的负载均衡支持配置为根据运行的请求数量来均衡吗, 让执行中请求较少的模型来运行新的请求
目前在一台机器多卡多replica,经常有因为一些请求上下文长输出的比较慢 被分配的越来越多
这个开源短期应该不会支持,属于 work stealing 的工作了。
This issue is stale because it has been open for 7 days with no activity.
This issue was closed because it has been inactive for 5 days since being marked as stale.