couler-proj/couler

Need additional clarification/examples around using set_dependencies+map

varunm22 opened this issue · 2 comments

Summary

I'm confused on how to properly use dependencies. Let's say I have a workflow with 4 groups of steps (A, B, C, D) and each has multiple subtasks that can happen in parallel (A1, A2, ..., B1, B2, ...). Currently, I'm adding all the A steps using couler.map, then adding all the B steps with couler.map, etc. This correctly parallelizes across A1, A2, ..., but none of the B steps start until all the A steps have completed, despite the fact that I never explicitly set dependencies.

In this case, I want A and B to run in parallel, then C then D. Having this run sequentially as A, B, C, D is technically correct, but not ideally performant. However, given that I'm not setting dependencies, and they're still running sequentially, I feel like using the set_dependencies function wouldn't help. Also, when I tried to use the set_dependencies function, the couler code errored on parsing its own generated yaml due to duplicate anchor definitions. Would definitely like to see a more in-depth example than those currently present in the README which shows how to properly use set_dependencies in combination with functions like map.

Use Cases

Mostly explained above.


Message from the maintainers:

Impacted by this bug? Give it a 👍. We prioritize the issues with the most 👍.

I think I can phrase this question more concisely.

We want a job that looks like this:



       start
      /     \
     /       \
   /|\       /|\
  / | \     / | \
 A1...AN   B1...BN
  \ | /     \ | /
     \        /
	  \      / 
       \    / 
         C
		 |
         D

Where A and B are separate sets of commands wrapped in couler.map()

What we get is this:

  start
    |
   /|\
  / | \
 A1...AN
  \	| /
    |
  / | \
 B1...BN
  \	| /
    |
	C
	|
	D

Is there a way to get this to work in Couler? Do we have to use a DAG?