kubedl-io/kubedl

[BUG] pytorch distributed training task is unschedulable when using volcano gang scheduling

Closed this issue · 5 comments

What happened:
I can not submit pytroch training task successfully when using volcano as scheduler. After discuss with @shinytang6 and @SimonCqk , they find that kubedl DAG is in conflict with volcano gang scheduling.
For me detail info, please see volcano-sh/volcano#1959

What you expected to happen:
make pytroch training job which mentioned above schedulable when using volcano gang scheduler.

/cc

@CaRRotOne @kerthcet Hi there, as we discussed in volcano-sh/volcano#1959, I'm working on this to fix the issue

Please let me known if there's any progress. Thanks a lot.

@kerthcet @CaRRotOne hi, I have posted a new pull request to track & fix this issue.

@CaRRotOne @kerthcet I'll build a latest image tagged with daily and push to docker hub, if you want to try as soon as possible, please pull the latest commits in master branch :)