iterative/dvc

Stages with conditional dependency

EvanKomp opened this issue · 4 comments

Correct me if this already exists, I seem to see some merges from 2018 that may be related (#646 ) but see no examples.

Essentially I have a stage that prepares a model, of which I would like to specify multiple options as parameters. Each model has a potentially unique preprocessing step, BUT some models share an additional preprocessing step.

For example, param model modulates stage predict, which for some models requires no previous stage, but for others requires a stage preprocess. How can I ensure that preprocess is run for the required models but not rerun it because it is expensive. If I have the preprocess step also conditioned on param model, it will rerun the step even if I switch between models where it does not need to be rerun.

Thanks for any wisdom.

Could you provide a simplified dvc.yaml to clarify how your pipeline is set up?

@dberenbaum

stages:
  preprocess:
    cmd:  ./prepare.sh
    outs:
      - ./data/preprocessing/
  predict:
    cmd ./predict.sh
    params:
      - model_type       # One of A, B, C
    deps:
      - ./data/preprocessing/       # THIS ONLY NEEDS TO BE A DEPENDANCY OF `model_type` in [A, B]
    outs:
      - ./data/predictions/

Unfortunately, I can't think of a good way to do it without creating separate stages/pipelines. If you have some idea of what you would want it to look like, feel free to suggest it here.

Affirmative. Thanks for your work. I think expanding on the yaml like you would with a cache tag would be best. eg.

stages:
  preprocess:
    cmd:  ./prepare.sh
    outs:
      - ./data/preprocessing/
  predict:
    cmd ./predict.sh
    params:
      - model.model_type       # One of A, B, C
    deps:
      - ./data/preprocessing/:

# conditioning syntax
           conditions: # these are executable strings with params as local namespace
             - 'model.model_type in ["A", "B"]


    outs:
      - ./data/predictions/