kubedl-io/kubedl

[BUG] the DAGScheduling and GangScheduling(volcano) conflict in mpijob

HeGaoYuan opened this issue · 15 comments

What happened:
The mpijob worker pods are pending, and there is no launcher pod

mpi-demo-worker-0             0/1     Pending     0          13s
mpi-demo-worker-1             0/1     Pending     0          13s

The events of worker pod are as follows

Events:
  Type     Reason            Age   From     Message
  ----     ------            ----  ----     -------
  Warning  FailedScheduling  65s   volcano  3/2 tasks in gang unschedulable: pod group is not ready, 2 Pending, 3 minAvailable.

I think the core reason is the DAGScheduling and GangScheduling(volcano) conflict in mpijob.

I can fix this problem by adding this args in the kubedl deployment.

- --feature-gates
- DAGScheduling=false

What you expected to happen:

No pending

How to reproduce it:
enable DAGScheduling and GangScheduling(volcano) to run a mpijob

Anything else we need to know?:

Environment:

  • KubeDL version:
  • Kubernetes version (use kubectl version):
  • OS (e.g: cat /etc/os-release):
  • Kernel (e.g. uname -a):
  • Install tools:
  • Others:

@HeGaoYuan thanks GaoYuan, for current implementation, DAGScheduling does conflict with GangScheduling, you can disable DAGScheduling for temporary by setting up start up flags : --feature-gates=DAGScheduling=false

gang scheduling should split up when DAG is enabled, for example: a TFJob with 1 ps and 10 workers, there should be 2 PodGroups grouped by replica, please assign an issue for me and I'll implement it later.

@HeGaoYuan thanks GaoYuan, for current implementation, DAGScheduling does conflict with GangScheduling, you can disable DAGScheduling for temporary by setting up start up flags : --feature-gates=DAGScheduling=false

gang scheduling should split up when DAG is enabled, for example: a TFJob with 1 ps and 10 workers, there should be 2 PodGroups grouped by replica, please assign an issue for me and I'll implement it later.

In my understanding, the reason why we use GangScheduling is avoid deadlock when multiple job instances request resources. I think 2 PodGroups also may lead to deadlock problem. Maybe we should think a better method to solve this problem?

In this case, the launcher pod is only a single pod and requires less resources. To keep it simple, we can make all workers as a gang and exclude the launcher pod.

This won't solve all the cases, but it is simple and worth to try if it works in most cases in reality. what do you think ?

In this case, the launcher pod is only a single pod and requires less resources. To keep it simple, we can make all workers as a gang and exclude the launcher pod.

This won't solve all the cases, but it is simple and worth to try if it works in most cases in reality. what do you think ?

In my opinion, because we have the launcherRunsWorkload ability, so the launcher pod may request more resources.

ok, in that case, we may disable the dag schduling if launcherRunsWorkload is enabled, @HeGaoYuan @SimonCqk opinions ?

ok, in that case, we may disable the dag schduling if launcherRunsWorkload is enabled, @HeGaoYuan @SimonCqk opinions ?

In my opinion, we may need to refactor the dag scheduling. Like the implementation of mpijob's initContainers, we can implement the dag scheduling by injecting the initContainer, the initContainer will wait the dependency pod to running.

In my opinion, we may need to refactor the dag scheduling. Like the implementation of mpijob's initContainers, we can implement the dag scheduling by injecting the initContainer, the initContainer will wait the dependency pod to running.

then there comes two anomalies:

  • thousands of init-containers will be created and watch or polling from apiserver, which causes heavy overhead for control plane.
  • downstream pods may scheduled in advance(not only upstream Running tiggers downstream, but also upstream Succeed triggers downstream), who consumed resources but keep stalling.

is launcherRunsWorkload a valid use case ? why is that needed ? @HeGaoYuan

is launcherRunsWorkload a valid use case ? why is that needed ? @HeGaoYuan

launcherRunsWorkload is not introduced by me, but I use this ability in practice. The only advantage of this ability is it will reduce the number of pods

speaking of launcherRunsWorkload, now the launcherRunsWorkload is a global variable of kubedl, I suggest to change it as a field of job, what do you think? @jian-he @SimonCqk

In my opinion, we may need to refactor the dag scheduling. Like the implementation of mpijob's initContainers, we can implement the dag scheduling by injecting the initContainer, the initContainer will wait the dependency pod to running.

then there comes two anomalies:

  • thousands of init-containers will be created and watch or polling from apiserver, which causes heavy overhead for control plane.
  • downstream pods may scheduled in advance(not only upstream Running tiggers downstream, but also upstream Succeed triggers downstream), who consumed resources but keep stalling.

good consideration!

speaking of launcherRunsWorkload, now the launcherRunsWorkload is a global variable of kubedl, I suggest to change it as a field of job, what do you think? @jian-he @SimonCqk

good point, actually the global flag launcherRunsWorkload can be removed, mpiReplicaSpecs with Launcher role indicates that mpijob will be driven by Launcher pod, which is launcherRunsWorkload semantics.

@HeGaoYuan I post an issue and will refactor it soon #194

@HeGaoYuan Hi, I've posted a pull requests to fix it, and a latest image tagged with daily will be pushed to docker hub, if you want to try as soon as possible, please pull the latest commits in master branch :)

@HeGaoYuan Hi, I've posted a pull requests to fix it, and a latest image tagged with daily will be pushed to docker hub, if you want to try as soon as possible, please pull the latest commits in master branch :)

Great! I will try it.