Need additional clarification/examples around using set_dependencies+map
varunm22 opened this issue · 2 comments
Summary
I'm confused on how to properly use dependencies. Let's say I have a workflow with 4 groups of steps (A, B, C, D) and each has multiple subtasks that can happen in parallel (A1, A2, ..., B1, B2, ...). Currently, I'm adding all the A steps using couler.map, then adding all the B steps with couler.map, etc. This correctly parallelizes across A1, A2, ..., but none of the B steps start until all the A steps have completed, despite the fact that I never explicitly set dependencies.
In this case, I want A and B to run in parallel, then C then D. Having this run sequentially as A, B, C, D is technically correct, but not ideally performant. However, given that I'm not setting dependencies, and they're still running sequentially, I feel like using the set_dependencies function wouldn't help. Also, when I tried to use the set_dependencies function, the couler code errored on parsing its own generated yaml due to duplicate anchor definitions. Would definitely like to see a more in-depth example than those currently present in the README which shows how to properly use set_dependencies in combination with functions like map.
Use Cases
Mostly explained above.
Message from the maintainers:
Impacted by this bug? Give it a 👍. We prioritize the issues with the most 👍.
I think I can phrase this question more concisely.
We want a job that looks like this:
start
/ \
/ \
/|\ /|\
/ | \ / | \
A1...AN B1...BN
\ | / \ | /
\ /
\ /
\ /
C
|
D
Where A
and B
are separate sets of commands wrapped in couler.map()
What we get is this:
start
|
/|\
/ | \
A1...AN
\ | /
|
/ | \
B1...BN
\ | /
|
C
|
D
Is there a way to get this to work in Couler? Do we have to use a DAG?