Pipelines / DAG

Question

Pipelines / DAG

arthur-flam opened this issue 5 years ago · 1 comments

Currently QA-Board lacks expressiveness for our common use-case of:

Run on some images
Calibration
Validation
Likewise, we can't express easily pipelines like training-evaluation.

We need to express running series of steps / pipelines / tasks organized as directed-acyclic-graph.

We're looking for feedback or alternative ideas. Especially if you have experience with various flow engines, e.g. DVC. Thanks!

Workarounds

User have done this:

wrapped qa batch with a scripted pipeline
wrote complicated run() function with lots of logic

Status

Implement user-side support for sequential pipelines
Support pipelines officially in QA-Board
Support DAGs

Possible API

batch1:
  inputs:
  - A.jpg
  - B.jpg
  configurations:
  - base

batch2:
  needs: batch1
  type: script
  configurations:
  - python my_script.py {o.output_dir for o in needs["batch1"]}

More complex:

my-calibration-images:
    configurations:
    - base
    inputs:
    - DL50.raw
    - DL55.raw
    - DL65.raw
    - DL75.raw

my-calibration:
    needs:
      calibration_images: my-calibration-images
    type: script
    configurations:
    - python calibration.py ${o.output_directory for o in depends[calibration_images]}

my-evaluation-batch:
    needs:
      calibration: my-calibration
    inputs:
    - test_image_1.raw
    - test_image_2.raw
    - test_image_3.raw
    configurations:
    - base
    - ${depends[calibration].output_directory}/calibration.cde

$ qa batch my-evaluation-batch
#=> qa batch my-calibration-images
#=> qa batch my-calibration
#=> qa batch my-evaluation-batch

Thoughts

We should add built-in support for script input types, than just executes their config as commands. It goes well with DAGs.

my-script:
  needs: batch1
  type: script
  configurations:
  - echo OK

Expected

Easy API
Cache friendly
Can be used in a non-blocking way

Answer 1 · 2020-07-24T16:34:22.000Z

Update: thanks to Itamar Persi and Ela Shahar, there is a pipeline implementation in "user-land":

my-pipeline:
  configs:
  - run: echo "Step 1"
  - batch: first-batch
  - batch:
    - second-batch
    - third-batch
    - label: batches running in parallel
  - run: some-postprocessing-script.py

Features include

using PIPELINE_OUTPUT_DIR to save data across the batch
providing to run steps info on the previous batch (what qa batch --list returns)

It's much simpler than a full DAG, and good enough in most cases.

Next steps

We'll contribute it to the project as a default run() if the input type is pipeline
Until then, we'll provide the code here on request as sample code (just comment here)...