Pipelines / DAG
arthur-flam opened this issue · 1 comments
arthur-flam commented
Currently QA-Board lacks expressiveness for our common use-case of:
- Run on some images
- Calibration
- Validation
Likewise, we can't express easily pipelines like training-evaluation.
We need to express running series of steps / pipelines / tasks organized as directed-acyclic-graph.
We're looking for feedback or alternative ideas. Especially if you have experience with various flow engines, e.g. DVC. Thanks!
Workarounds
User have done this:
- wrapped
qa batch
with a scripted pipeline - wrote complicated
run()
function with lots of logic
Status
- Implement user-side support for sequential pipelines
- Support pipelines officially in QA-Board
- Support DAGs
Possible API
batch1:
inputs:
- A.jpg
- B.jpg
configurations:
- base
batch2:
needs: batch1
type: script
configurations:
- python my_script.py {o.output_dir for o in needs["batch1"]}
More complex:
my-calibration-images:
configurations:
- base
inputs:
- DL50.raw
- DL55.raw
- DL65.raw
- DL75.raw
my-calibration:
needs:
calibration_images: my-calibration-images
type: script
configurations:
- python calibration.py ${o.output_directory for o in depends[calibration_images]}
my-evaluation-batch:
needs:
calibration: my-calibration
inputs:
- test_image_1.raw
- test_image_2.raw
- test_image_3.raw
configurations:
- base
- ${depends[calibration].output_directory}/calibration.cde
$ qa batch my-evaluation-batch
#=> qa batch my-calibration-images
#=> qa batch my-calibration
#=> qa batch my-evaluation-batch
Thoughts
- We should add built-in support for
script
input types, than just executes their config as commands. It goes well with DAGs.
my-script:
needs: batch1
type: script
configurations:
- echo OK
Expected
- Easy API
- Cache friendly
- Can be used in a non-blocking way
arthur-flam commented
Update: thanks to Itamar Persi and Ela Shahar, there is a pipeline implementation in "user-land":
my-pipeline:
configs:
- run: echo "Step 1"
- batch: first-batch
- batch:
- second-batch
- third-batch
- label: batches running in parallel
- run: some-postprocessing-script.py
Features include
- using
PIPELINE_OUTPUT_DIR
to save data across the batch - providing to
run
steps info on the previous batch (whatqa batch --list
returns)
It's much simpler than a full DAG, and good enough in most cases.
Next steps
- We'll contribute it to the project as a default
run()
if the input type ispipeline
- Until then, we'll provide the code here on request as sample code (just comment here)...