[BUG] pytorch distributed training task is unschedulable when using volcano gang scheduling
Closed this issue · 5 comments
What happened:
I can not submit pytroch training task successfully when using volcano as scheduler. After discuss with @shinytang6 and @SimonCqk , they find that kubedl DAG is in conflict with volcano gang scheduling.
For me detail info, please see volcano-sh/volcano#1959
What you expected to happen:
make pytroch training job which mentioned above schedulable when using volcano gang scheduler.
/cc
@CaRRotOne @kerthcet Hi there, as we discussed in volcano-sh/volcano#1959, I'm working on this to fix the issue
Please let me known if there's any progress. Thanks a lot.
@kerthcet @CaRRotOne hi, I have posted a new pull request to track & fix this issue.
@CaRRotOne @kerthcet I'll build a latest image tagged with daily
and push to docker hub, if you want to try as soon as possible, please pull the latest commits in master branch :)