[BUG] the DAGScheduling and GangScheduling(volcano) conflict in mpijob

Question

[BUG] the DAGScheduling and GangScheduling(volcano) conflict in mpijob

HeGaoYuan opened this issue 3 years ago · 15 comments

What happened:
The mpijob worker pods are pending, and there is no launcher pod

mpi-demo-worker-0             0/1     Pending     0          13s
mpi-demo-worker-1             0/1     Pending     0          13s

The events of worker pod are as follows

Events:
  Type     Reason            Age   From     Message
  ----     ------            ----  ----     -------
  Warning  FailedScheduling  65s   volcano  3/2 tasks in gang unschedulable: pod group is not ready, 2 Pending, 3 minAvailable.

I think the core reason is the DAGScheduling and GangScheduling(volcano) conflict in mpijob.

I can fix this problem by adding this args in the kubedl deployment.

- --feature-gates
- DAGScheduling=false

What you expected to happen:

No pending

How to reproduce it:
enable DAGScheduling and GangScheduling(volcano) to run a mpijob

Anything else we need to know?:

Environment:

KubeDL version:
Kubernetes version (use kubectl version):
OS (e.g: cat /etc/os-release):
Kernel (e.g. uname -a):
Install tools:
Others:

Answer 1 · 2021-09-23T13:20:01.000Z

@HeGaoYuan thanks GaoYuan, for current implementation, DAGScheduling does conflict with GangScheduling, you can disable DAGScheduling for temporary by setting up start up flags : --feature-gates=DAGScheduling=false

gang scheduling should split up when DAG is enabled, for example: a TFJob with 1 ps and 10 workers, there should be 2 PodGroups grouped by replica, please assign an issue for me and I'll implement it later.

Answer 2 · 2021-09-23T14:09:27.000Z

@HeGaoYuan thanks GaoYuan, for current implementation, DAGScheduling does conflict with GangScheduling, you can disable DAGScheduling for temporary by setting up start up flags : --feature-gates=DAGScheduling=false

gang scheduling should split up when DAG is enabled, for example: a TFJob with 1 ps and 10 workers, there should be 2 PodGroups grouped by replica, please assign an issue for me and I'll implement it later.

In my understanding, the reason why we use GangScheduling is avoid deadlock when multiple job instances request resources. I think 2 PodGroups also may lead to deadlock problem. Maybe we should think a better method to solve this problem?

Answer 3 · 2021-09-30T18:21:00.000Z

In this case, the launcher pod is only a single pod and requires less resources. To keep it simple, we can make all workers as a gang and exclude the launcher pod.

This won't solve all the cases, but it is simple and worth to try if it works in most cases in reality. what do you think ?

Answer 4 · 2021-09-30T22:40:15.000Z

In this case, the launcher pod is only a single pod and requires less resources. To keep it simple, we can make all workers as a gang and exclude the launcher pod.

This won't solve all the cases, but it is simple and worth to try if it works in most cases in reality. what do you think ?

In my opinion, because we have the launcherRunsWorkload ability, so the launcher pod may request more resources.

Answer 5 · 2021-10-01T17:19:07.000Z

ok, in that case, we may disable the dag schduling if launcherRunsWorkload is enabled, @HeGaoYuan @SimonCqk opinions ?

Answer 6 · 2021-10-12T15:07:03.000Z

ok, in that case, we may disable the dag schduling if launcherRunsWorkload is enabled, @HeGaoYuan @SimonCqk opinions ?

In my opinion, we may need to refactor the dag scheduling. Like the implementation of mpijob's initContainers, we can implement the dag scheduling by injecting the initContainer, the initContainer will wait the dependency pod to running.

Answer 7 · 2021-10-13T03:17:57.000Z

In my opinion, we may need to refactor the dag scheduling. Like the implementation of mpijob's initContainers, we can implement the dag scheduling by injecting the initContainer, the initContainer will wait the dependency pod to running.

then there comes two anomalies:

thousands of init-containers will be created and watch or polling from apiserver, which causes heavy overhead for control plane.
downstream pods may scheduled in advance(not only upstream Running tiggers downstream, but also upstream Succeed triggers downstream), who consumed resources but keep stalling.

Answer 8 · 2021-10-13T05:32:27.000Z

is launcherRunsWorkload a valid use case ? why is that needed ？ @HeGaoYuan

Answer 9 · 2021-10-13T06:41:53.000Z

is launcherRunsWorkload a valid use case ? why is that needed ？ @HeGaoYuan

launcherRunsWorkload is not introduced by me, but I use this ability in practice. The only advantage of this ability is it will reduce the number of pods

Answer 10 · 2021-10-13T06:48:25.000Z

speaking of launcherRunsWorkload, now the launcherRunsWorkload is a global variable of kubedl, I suggest to change it as a field of job, what do you think? @jian-he @SimonCqk

Answer 11 · 2021-10-13T12:57:01.000Z

In my opinion, we may need to refactor the dag scheduling. Like the implementation of mpijob's initContainers, we can implement the dag scheduling by injecting the initContainer, the initContainer will wait the dependency pod to running.

then there comes two anomalies:

thousands of init-containers will be created and watch or polling from apiserver, which causes heavy overhead for control plane.

downstream pods may scheduled in advance(not only upstream Running tiggers downstream, but also upstream Succeed triggers downstream), who consumed resources but keep stalling.

good consideration!

Answer 12 · 2021-10-14T03:13:26.000Z

speaking of launcherRunsWorkload, now the launcherRunsWorkload is a global variable of kubedl, I suggest to change it as a field of job, what do you think? @jian-he @SimonCqk

good point, actually the global flag launcherRunsWorkload can be removed, mpiReplicaSpecs with Launcher role indicates that mpijob will be driven by Launcher pod, which is launcherRunsWorkload semantics.

Answer 13 · 2021-10-14T03:16:51.000Z

@HeGaoYuan I post an issue and will refactor it soon #194

Answer 14 · 2022-01-25T07:09:37.000Z

@HeGaoYuan Hi, I've posted a pull requests to fix it, and a latest image tagged with daily will be pushed to docker hub, if you want to try as soon as possible, please pull the latest commits in master branch :)

Answer 15 · 2022-01-26T11:08:16.000Z

@HeGaoYuan Hi, I've posted a pull requests to fix it, and a latest image tagged with daily will be pushed to docker hub, if you want to try as soon as possible, please pull the latest commits in master branch :)

Great! I will try it.