vatlab/sos

The stop/fail/warn_if series of actions

BoPeng opened this issue · 46 comments

The skip option had a long history from VPT and is used to skip a section

[10: skip=True]

It was supposed to be used for

  1. Quickly comment on a step but renaming the step would have the same effect.
  2. Enable certain steps depending on the complexity of the workflow (e.g. sos run --quality-control true) but we already have many ways to do the same (e.g. subworkflow).

I have never used it for any production code and do not see a good usage for it.

gaow commented

Hmm I've always been using stop_if but I can see skip_if an appealing feature. eg I do use:

stop_if(some_file.exists(), msg = '...')

so maybe we want to make it skip_if and decprecate that section option?

Currently we do not have skip_if but we do have stop_if, which essentially serve as skip_if because it does not raise an error. There is even a test case for

[10]
output: 'a.txt'
_output.touch()
stop_if(true)

for a.txt to be kept when the substep is stopped.

I have not tested but I think

[10: skip=True]

would be the same as

[10]
stop_if(True)
gaow commented

well,

[1]
stop_if(True)
[2]
print(999)

gives:

999

so stop_if is indeed skip_if. I was under the impression stop_if will quit the execution without printing 999 -- then this might warrant a feature request for a real stop_if -- some parameters, or for clarity, make stop_if and skip_if separate?

There is a fail_if.

gaow commented

There is a fail_if.

I understand. But fail_if will result in return code 1 right? What if we want a peaceful complete stop?

I think conceptually there can be a skip_if, that is to say,

  • skip_if is the safe skip, with no consequence. (this is the current stop_if).
  • stop_if to stop the current substep, with the _output removed due to failure of the substep.
  • fail_if to stop the entire workflow.

Then that thing can be called done_if. 😄

With so many xxx_if then all the active options can be deprecated #1064 . As I said, the problem with active is that we do not have control if the consequence is skip or stop.

gaow commented

stop_if to stop the current substep, with the _output removed due to failure of the substep.

I'm not sure why you want this. If the step fails it should just quit on error. fail_if should stop the entire workflow with output from current step removed.

So yes we can have skip_if, stop_if and fail_if; stop_if is done_if.

RE #1064: I think an attempt was made along the line of not using active in #1108 but did not work. Maybe we can revisit that example after we have implemented skip_if? Things might have worked now.

Removing _output is now the standard procedure for atomic write. So

  • skip_if will be like "break" of a loop. However, the _output needs to be generated before it. Otherwise the step will fail with _output does not exist after the completion of the step.

  • stop_if currently stops the substep, but not the entire step. If also does not removing existing _output. I am not sure what behavior you would like to have.

  • fail_if will terminate the execution of the workflow. Not sure if it kills running steps.

  • done_if will mark the end of the workflow. Not sure if it kills running steps.

So basically there are several complications here:

  1. If other substeps will be affected.
  2. If _output will be removed.
  3. If other running steps should be killed.
gaow commented

skip_if will be like "break" of a loop.

well, rather be like continue of a loop? I'm saying the current stop_if should be renamed to skip_if. Its corresponding _output should be removed from step_output. Now I agree what you said previously that "the _output removed due to failure of the substep." -- but it is not a failure, it is what we asked for.

So without worrying about backwards compatibility, conceptually:

  1. skip_if to replace current stop_if which is continue for a loop; but output properly adjusted
  2. stop_if to behave like done_if to peacefully end a workflow. It does not kill running steps.
  3. fail_if to quit with error, break loop in substep if applicatble, and kills running steps

Now we lack a break loop behavior which I'm not sure would be necessary ... trying to think of a case we might need it.

gaow commented

And we are evaluating this with #1064 in mind ... so we might want to be thorough here and resolve both tickets.

I do not have a good overall solution now, perhaps in the end we will need to have something like

stop_if(condition, remove_output, end=substep/step/workflow)

Also, skip means skip at least the substep (remove what might have been done till now), so _output should be removed silently, The continue operation sounds more like end_substep_if...

In any case let us deprecate the step option skip for now.

We will need to list all use cases and their pros and cons to decide what to do with this. For example, stop_if does not work well for

input: a large number of groups
stop_if(_index > 10)

because all steps will be generated and then be stopped. In this particular case we are looking for something like break, not the stop_if, which is essentially continue.

gaow commented

I still think my comment here plus the possible need for a break behavior (also in that comment) are good enough. Of course we might want to keep the behavior of stop_if for backwards compatibliity.

BTW even fail_if now does not break the loop. I think it should.

gaow commented

Just to push a bit more on the progress of this discussion: see #1159 (comment) a motivating example to have something different (or to behave differently)

Thinking of application, I think the following can be useful but we need to make sure that the names are not confusing:

name usage remove _output stop step step workflow
warn_if gives a warning no no no
stop_if(keep_output=True)* continue, skip the rest of the substep, assuming output already exist no no no
stop_if stop the substep, assuming output is invalid and should be removed if already exist yes no, run other substeps as planned no
stop_if(stop_all) * break like, stop sending new substeps yes for substeps > _index no no
fail_if exception exception, no step_output will be returned yes yes, kill other steps immediately *

* new or change of behavior

gaow commented
  1. when fail_if stops (the rest of) a step with error, doesnt it automatically stop the entire workflow? Or the parts of DAG not depending on it will still work?
  2. I think some peaceful stop of the entire workflow is also welcomed? So I think fail_if should stop the workflow and terminate_if quits the entire workflow peacefully.
  3. why do we ever need keep_output = True? that output is generated in previous code chunks but the rest of the code will not be executed? I do not see this happen very often ... at least ggenerating output is usually the last few lines of my code. ...

So for 3 I'm still tshinking:

name usage remove _output stop step step workflow
skip_if ? stop the substep, assuming output is invalid and should be removed if already exist yes no, run other substeps as planned no
stop_if / break_if break like, stop sending new substeps yes for substeps > _index no no

there will be compatibility issues at least on my end, but i can live with it and change my pipelines ... maybe for you there is a bigger problem?

For skip_if, I had sometimes "manual signature" check, something like

[10]
stop_if('done' in whatever_mechanism)
task:
long_job

Technically speaking, there is a difference in whether or not 10 should be explicitly killed due to fail_if in 20.

[10]
sleep(20000)

[20]
input: None
fail_if(True)

But I do not know which way is better or if we should allow both. This also applies to done_if and done_if(all) (for stopping the entire workflow).

The current behavior is that sos would wait for the completion of step 10 before exiting with step 20.

gaow commented

I still think the scope of something like done_if is confined within a step and fail_if stops the entire workflow -- basically if one sees something "fail", it would be natural to stop everything and fix it up immediately, rather than having to live with it and wait for other tasks to finish before attending to the problematic runs. It is somewhat awkward otherwise.

See the above table. What about

  1. Change the behavior of fail_if to kill immediately.
  2. Keep stop_if as it is (remove _output)
  3. Add skip_if or an option to stop_if to disable the behavior of removing _output
  4. Add done_if as the break behavior. It will continue to submit _index <= done_if_index, stop submitting new and remove _output of _index > done_if_index.
gaow commented

I see. I mostly agree with the current proposal, plus:

  1. If we are to keep the behavior of stop_if then I vote for adding an option to let stop_if keep the output.
  2. The problem to have done_if is that in scenarios not involving substeps, it will behave like stop_if. So maybe we should use another stop_if option to implement this, something like:

stop_if(mode = 'break') for the done_if
stop_if(mode = 'continue') for the stop_if(keep_output = True) ?

OK, the mode parameter is not terribly clear what will happen, so I would prefer two parameters:

  • stop_all=True, keep_output=True: this assume everything has been done before, we stop every substep but does not change _output.
  • stop_all=True, keep_output=False: we need to be more careful about this one because we have to do things differently for substeps before and after _index as proposed.
  • stop_all=False, keep_output=True: assuming this substep is done, keep other substeps running
  • (default) stop_all=False, keep_output=False: not do this one at all, remove _output.
gaow commented

Okay, but stop_all sounds less intuitive than break ... ? Especially when the function name already has a stop

break_all or just all? We have all or -a from time to time.

This patch fixes

import time

[10]
time.sleep(2000)

[20]
input: None
time.sleep(2)
fail_if(True)
gaow commented

I think all is good.

But stopIf(all=True) does not sound like break. It appears to to apply the action to all substeps.

gaow commented

Well I preferred all over break_all, but it in fact is just break, right? stop_if(break = True); or quit_step = True, if you find break a jargon?

stop_if(_index > 32)

is the same as

stop_if(_index == 32, break=True)

I complaint about the former because all substeps needs to be executed but at the same time _output has a chance to be generated which makes

stop_if(_index > 32, keep_output=True)

a lot easier to implement.

So if we do not consider performance (which could be implemented in a more clever way, perhaps), I think

stop_if(_index > 32, keep_output=True)

if better than

stop_if(_index == 32, break=True)
gaow commented
stop_if(_index > 32)

is more natural to think of than using break. But I thought we need break for situations more complicated than using _index. For example, when, say, remaining disk space is low, so break will quit it once and for all. I think _index is still a somewhat advanced feature we do not expect people to use a lot.

My concern is that

stop_if(all=True, keep_output=True)

is difficult to incomplete as we will have to evaluate all _output anyway for keep_output=True. Then the break behavior of

stop_if(all=True, keep_output=False)

is troublesome to implement in the concurrent case because substeps before it might not have been submitted and substeps after it might have been running, and the only easy way to incomplete it is to disable concurrency in this case.

So in the end we do not have a good solution in either case and we do not have a convincing use case for it, perhaps we can keep stop_if at the substep level for now.

Not saying we need it now, but the all can be named skip_rest and we can disallow the case skip_rest=True, keep_output=True since it is conceptually difficult to understand (we cannot get output of skipped substeps). Then, skip_rest=True will skip all the rest of the substeps.

gaow commented

Well, the interface is now stop_if(no_output=False). I argue that we should make no_output=True or keep_output=False by default. Also it would be nice if we can finalize the interface? The problem with skip_rest is that in practice it is difficult to determine what it is meant by "rest" for concurrent process. I still feel "break" sounds better? Also for concurrent process, i'm not sure if skip_rest or break is well defined -- you might get random results every time you run it, depending on when the "stop_if" behavior is triggered.

So maybe let's finalize the interface by first deciding if we use stop_if(no_output=False) and stop_if(keep_output=True)? We can table the decision on break behavior if the random nature of it bothers?

The argument for stop_if(no_output=False) is at #1161. Basically it is bad to use an operation that removes a user-generated output as default. skip_rest is difficult to implement, and that is why I did not implement it.

gaow commented

Basically it is bad to use an operation that removes a user-generated output as default

I agree with this particular point. But my arguement was ususally stop_if should happen before output is generated anyways: at least for most of my pipelines, if not all, outputs are generated at the end of the step. So typically when stop_if happens there is nothing to remove anyways. But the problem is that with no_otput=False these output will be evaluated at the next step which will then complain about missing output thus breaking the pipeline altogether.

BTW I've got a worse example that i'm trying to make a MWE for -- I think stop_if(no_output=False) is making my other concurrent substeps fail. Does nto sound like it makes sense but that's what I think is happening and I'm trying to confirm with MWE.

gaow commented

I think stop_if(no_output=False) is making my other concurrent substeps fail.

Cannot reproduce it now. Please ignore.

I think now I can live stop_if(no_output=False) by default given everything considered. But can we change no_output to remove_output? That feels like removing it both from disk and from step_output.

I considered remove_output and discard_output but these implies that the output is already generated and should be removed. In contrast no_output can implies both.

So we have

  • no_output=False, default behavior.
    • If output generated, good, keep it. remove_output=False is appropriate.
    • if output is not generated, error because there is supposed to be output. remove_output=False is not quite ok.
  • no_output=True,
    • if output generated, remove it as user asks, remove_output=True is ok.
    • if output not generated, do nothing. remove_output=True is not quite ok because output was not generated in the first place.

But all these are subtle and tend to be confusing and are why we could not settle down with a good set of names. Perhaps it is good to separate the use cases and say something like

output:
produce_output
done_if()  # already done (output generated)

and

output:
skip_if()   # skip substep (and no output has been generated)
produce_output
gaow commented

but these implies that the output is already generated and should be removed.

they also imply output will be removed from step_output which is why I am in favor of those names. no_output does not quite highlight this. I think it is more important to highlight from workflow point of view.

Perhaps it is good to separate the use cases and say something like

In my head I already considered stop_if(no_output=True) equal to skip_if. I'm not quite sure what done_if is for (i know what it does but not sure if it worth separating). I think stop_if behavior already covers at least my two user cases with no_output=True and False. I only have issues with the argument name that does not highlight consequence to step_output, and the fact it breaks following pipelines by default in a not so peaceful way (fail with error); hence I felt skip behavior is more approperate by default but i do not have a strong opinion on it.

[1]
input: for_each = {'i': range(10)}
output: ...
stop_if(i == 4)

[2]
...
[GW] sos run test1.sos 
INFO: Running 1: 
INFO: output:   0.txt 1.txt... (10 items in 10 groups)
ERROR: [1]: Output target 4.txt does not exist after the completion of step 1
[default]: 1 pending step: 2

My point was that stop does not automatically imply if the output has been generated, and the users have to think of the use of parameter no_output or remove_output, which is technical details. Instead of the parameter, perhaps we can just write done two cases

output:
produce_output ...
done_if()

here done means what we want to achieve for the substep is already done, namely output already created, and

output:
skip_if()
produce_output...

here skip means we do not want to run the substep, which naturally means no _output.

If we introduce the actions with the complete patterns (order of output, action, and produce_output), perhaps it is much easier for users to understand.

gaow commented

Sure I think we are on the same page. It is just that I'm already used to thinking of done_if = stop_if and skip_if = stop_if(no_output=True) so I'm indifferent towards introducing these two additional options: to me they are just alias.

My workflow example above tries to make a case that fails weird, which I now use done_if to show that the problem still is there:

[1]
input: for_each = {'i': range(10)}
output: ...
done_if(i == 4)
generate_output(...)
[2]
...

the point is there is no restriction where done_if appears so in this case it breaks [2] in a not so clean way. Or do you consider that user error, or expected behavior?

That would be an user error because it says the step is done before the output is actually generated.

gaow commented

That would be an user error because it says the step is done before the output is actually generated.

Okay now I agree the name done_if is intuitive in implying it is a user error.

So to summarize:

  1. we want to keep stop_if as is for backwards compatiblilty but we stop documenting and using it
  2. we introduce done_if and skip_if
  3. behavior of fail_if and warn_if remains unchanged

right?

Yes, that sounds better than stop_if(no_output=True/False). We have swing between different options quite a few times now, let us finalize it this way.