Fake load balancing?

Question

Fake load balancing?

Closed this issue 3 months ago · 15 comments

ck7colin commented 3 months ago

System Info / 系統信息

cuda 12.4 4090D

Running Xinference with Docker? / 是否使用 Docker 运行 Xinfernece？

docker / docker
pip install / 通过 pip install 安装
installation from source / 从源码安装

Version info / 版本信息

0.16.0

The command used to start Xinference / 用以启动 xinference 的命令

分布式部署

Reproduction / 复现过程

The model can only be deployed on a single node, and not on multiple nodes, leading to other nodes being unable to share the load。假的负载均衡？模型只能部署在单一节点，而不能多个节点，导致其他节点无法分担负载。尽管我有两个worker节点，但是我只能部署使用其中一个节点，无法在另外的节点使用相同的uid部署同样的模型，对外提供统一服务。

Expected behavior / 期待表现

真正的负载均衡。

Answer 1 · 2024-11-29T09:51:34.000Z

Now you need to specify replica=2 to share the same uid.

Answer 2 · 2024-11-29T10:51:10.000Z

even do that, they still in the same node.I want balance to another node.

Answer 3 · 2024-11-29T10:52:05.000Z

Now you need to specify replica=2 to share the same uid.

I want use the same uid in two different worker node.

Answer 4 · 2024-11-29T11:31:48.000Z

Now you need to specify replica=2 to share the same uid.

I want use the same uid in two different worker node.

Now the worker_ip and gpu idexes are not well designed for multiple replicas, actually internally there is no limitation.

Answer 5 · 2024-11-29T11:36:50.000Z

so xinference can't reach my requirements. and they have no load banlace.

Now you need to specify replica=2 to share the same uid.

I want use the same uid in two different worker node.

Now the worker_ip and gpu idexes are not well designed for multiple replicas, actually internally there is no limitation.

so xinference can't reach my requirements. and they have no load banlace.

Answer 6 · 2024-11-29T11:42:07.000Z

@qinxuye 你可以打中文吗，大家都是**人，沟通还要翻译一道

Answer 7 · 2024-11-29T11:42:35.000Z

any plan or road map for this?

Now you need to specify replica=2 to share the same uid.

I want use the same uid in two different worker node.

Now the worker_ip and gpu idexes are not well designed for multiple replicas, actually internally there is no limitation.

any plan or road map for this?

Answer 8 · 2024-11-29T11:48:05.000Z

@ck7colin 如果要实现真正的负载均衡估计得有个类似网关的组件吧，启动的组件里好像没这样的东西

Answer 9 · 2024-11-29T11:50:39.000Z

@ck7colin 如果要实现真正的负载均衡估计得有个类似网关的组件吧，启动的组件里好像没这样的东西

现状是：

replica>1 的时候，supervisor 会自动分配副本。这个时候是负载均衡的。
replica==1，可以通过指定 worker_ip 和 gpu idx 精确分配。

p.s. 我主要根据问问题的人主要用的英语还是中文回答问题，尤其是标题。提 issue 的不都是**人。

这个我只解释一次。

Answer 10 · 2024-11-29T12:01:26.000Z

@qinxuye "replica>1 的时候，supervisor 会自动分配副本。" 这个副本可以在不同机器节点上吗？

Answer 11 · 2024-11-29T13:18:34.000Z

@qinxuye "replica>1 的时候，supervisor 会自动分配副本。" 这个副本可以在不同机器节点上吗？

可以

Answer 12 · 2024-12-04T10:50:20.000Z

@qinxuye 现在的负载均衡支持配置为根据运行的请求数量来均衡吗, 让执行中请求较少的模型来运行新的请求

目前在一台机器多卡多replica,经常有因为一些请求上下文长输出的比较慢被分配的越来越多

Answer 13 · 2024-12-04T10:59:45.000Z

@qinxuye 现在的负载均衡支持配置为根据运行的请求数量来均衡吗, 让执行中请求较少的模型来运行新的请求

目前在一台机器多卡多replica,经常有因为一些请求上下文长输出的比较慢被分配的越来越多

这个开源短期应该不会支持，属于 work stealing 的工作了。

Answer 14 · 2024-12-11T19:03:54.000Z

This issue is stale because it has been open for 7 days with no activity.

Answer 15 · 2024-12-16T19:03:56.000Z

This issue was closed because it has been inactive for 5 days since being marked as stale.