The stop/fail/warn_if series of actions
BoPeng opened this issue · 46 comments
The skip
option had a long history from VPT and is used to skip a section
[10: skip=True]
It was supposed to be used for
- Quickly comment on a step but renaming the step would have the same effect.
- Enable certain steps depending on the complexity of the workflow (e.g.
sos run --quality-control true
) but we already have many ways to do the same (e.g. subworkflow).
I have never used it for any production code and do not see a good usage for it.
Hmm I've always been using stop_if
but I can see skip_if
an appealing feature. eg I do use:
stop_if(some_file.exists(), msg = '...')
so maybe we want to make it skip_if
and decprecate that section option?
Currently we do not have skip_if
but we do have stop_if
, which essentially serve as skip_if
because it does not raise an error. There is even a test case for
[10]
output: 'a.txt'
_output.touch()
stop_if(true)
for a.txt
to be kept when the substep is stopped.
I have not tested but I think
[10: skip=True]
would be the same as
[10]
stop_if(True)
well,
[1]
stop_if(True)
[2]
print(999)
gives:
999
so stop_if
is indeed skip_if
. I was under the impression stop_if
will quit the execution without printing 999
-- then this might warrant a feature request for a real stop_if
-- some parameters, or for clarity, make stop_if
and skip_if
separate?
There is a fail_if
.
There is a fail_if.
I understand. But fail_if
will result in return code 1 right? What if we want a peaceful complete stop?
I think conceptually there can be a skip_if
, that is to say,
skip_if
is the safe skip, with no consequence. (this is the currentstop_if
).stop_if
to stop the current substep, with the_output
removed due to failure of the substep.fail_if
to stop the entire workflow.
Then that thing can be called done_if
. 😄
With so many xxx_if
then all the active
options can be deprecated #1064 . As I said, the problem with active
is that we do not have control if the consequence is skip
or stop
.
stop_if to stop the current substep, with the _output removed due to failure of the substep.
I'm not sure why you want this. If the step fails it should just quit on error. fail_if
should stop the entire workflow with output from current step removed.
So yes we can have skip_if
, stop_if
and fail_if
; stop_if
is done_if
.
RE #1064: I think an attempt was made along the line of not using active
in #1108 but did not work. Maybe we can revisit that example after we have implemented skip_if
? Things might have worked now.
Removing _output
is now the standard procedure for atomic write
. So
-
skip_if
will be like "break" of a loop. However, the_output
needs to be generated before it. Otherwise the step will fail with_output
does not exist after the completion of the step. -
stop_if
currently stops the substep, but not the entire step. If also does not removing existing_output
. I am not sure what behavior you would like to have. -
fail_if
will terminate the execution of the workflow. Not sure if it kills running steps. -
done_if
will mark the end of the workflow. Not sure if it kills running steps.
So basically there are several complications here:
- If other substeps will be affected.
- If
_output
will be removed. - If other running steps should be killed.
skip_if will be like "break" of a loop.
well, rather be like continue
of a loop? I'm saying the current stop_if
should be renamed to skip_if
. Its corresponding _output
should be removed from step_output
. Now I agree what you said previously that "the _output removed due to failure of the substep." -- but it is not a failure, it is what we asked for.
So without worrying about backwards compatibility, conceptually:
skip_if
to replace currentstop_if
which iscontinue
for a loop; but output properly adjustedstop_if
to behave likedone_if
to peacefully end a workflow. It does not kill running steps.fail_if
to quit with error, break loop in substep if applicatble, and kills running steps
Now we lack a break
loop behavior which I'm not sure would be necessary ... trying to think of a case we might need it.
And we are evaluating this with #1064 in mind ... so we might want to be thorough here and resolve both tickets.
I do not have a good overall solution now, perhaps in the end we will need to have something like
stop_if(condition, remove_output, end=substep/step/workflow)
Also, skip
means skip at least the substep (remove what might have been done till now), so _output
should be removed silently, The continue
operation sounds more like end_substep_if
...
In any case let us deprecate the step option skip
for now.
We will need to list all use cases and their pros and cons to decide what to do with this. For example, stop_if
does not work well for
input: a large number of groups
stop_if(_index > 10)
because all steps will be generated and then be stopped. In this particular case we are looking for something like break
, not the stop_if
, which is essentially continue
.
I still think my comment here plus the possible need for a break
behavior (also in that comment) are good enough. Of course we might want to keep the behavior of stop_if
for backwards compatibliity.
BTW even fail_if
now does not break
the loop. I think it should.
Just to push a bit more on the progress of this discussion: see #1159 (comment) a motivating example to have something different (or to behave differently)
Thinking of application, I think the following can be useful but we need to make sure that the names are not confusing:
name | usage | remove _output |
stop step | step workflow |
---|---|---|---|---|
warn_if |
gives a warning | no | no | no |
stop_if(keep_output=True) * |
continue , skip the rest of the substep, assuming output already exist |
no | no | no |
stop_if |
stop the substep, assuming output is invalid and should be removed if already exist | yes | no, run other substeps as planned | no |
stop_if(stop_all) * |
break like, stop sending new substeps |
yes for substeps > _index |
no | no |
fail_if |
exception | exception, no step_output will be returned |
yes | yes, kill other steps immediately * |
* new or change of behavior
- when
fail_if
stops (the rest of) a step with error, doesnt it automatically stop the entire workflow? Or the parts of DAG not depending on it will still work? - I think some peaceful stop of the entire workflow is also welcomed? So I think
fail_if
should stop the workflow andterminate_if
quits the entire workflow peacefully. - why do we ever need
keep_output = True
? that output is generated in previous code chunks but the rest of the code will not be executed? I do not see this happen very often ... at least ggenerating output is usually the last few lines of my code. ...
So for 3 I'm still tshinking:
name | usage | remove _output |
stop step | step workflow |
---|---|---|---|---|
skip_if ? |
stop the substep, assuming output is invalid and should be removed if already exist | yes | no, run other substeps as planned | no |
stop_if / break_if |
break like, stop sending new substeps |
yes for substeps > _index |
no | no |
there will be compatibility issues at least on my end, but i can live with it and change my pipelines ... maybe for you there is a bigger problem?
For skip_if
, I had sometimes "manual signature" check, something like
[10]
stop_if('done' in whatever_mechanism)
task:
long_job
Technically speaking, there is a difference in whether or not 10
should be explicitly killed due to fail_if
in 20
.
[10]
sleep(20000)
[20]
input: None
fail_if(True)
But I do not know which way is better or if we should allow both. This also applies to done_if
and done_if(all)
(for stopping the entire workflow).
The current behavior is that sos would wait for the completion of step 10
before exiting with step 20
.
I still think the scope of something like done_if
is confined within a step and fail_if
stops the entire workflow -- basically if one sees something "fail", it would be natural to stop everything and fix it up immediately, rather than having to live with it and wait for other tasks to finish before attending to the problematic runs. It is somewhat awkward otherwise.
See the above table. What about
- Change the behavior of
fail_if
to kill immediately. - Keep
stop_if
as it is (remove_output
) - Add
skip_if
or an option tostop_if
to disable the behavior of removing_output
- Add
done_if
as the break behavior. It will continue to submit_index <= done_if_index
, stop submitting new and remove_output
of_index > done_if_index
.
I see. I mostly agree with the current proposal, plus:
- If we are to keep the behavior of
stop_if
then I vote for adding an option to letstop_if
keep the output. - The problem to have
done_if
is that in scenarios not involving substeps, it will behave likestop_if
. So maybe we should use anotherstop_if
option to implement this, something like:
stop_if(mode = 'break')
for the done_if
stop_if(mode = 'continue')
for the stop_if(keep_output = True)
?
OK, the mode
parameter is not terribly clear what will happen, so I would prefer two parameters:
stop_all=True
,keep_output=True
: this assume everything has been done before, we stop every substep but does not change_output
.stop_all=True
,keep_output=False
: we need to be more careful about this one because we have to do things differently for substeps before and after_index
as proposed.stop_all=False
,keep_output=True
: assuming this substep is done, keep other substeps running- (default)
stop_all=False
,keep_output=False
: not do this one at all, remove_output
.
Okay, but stop_all
sounds less intuitive than break
... ? Especially when the function name already has a stop
break_all
or just all
? We have all
or -a
from time to time.
This patch fixes
import time
[10]
time.sleep(2000)
[20]
input: None
time.sleep(2)
fail_if(True)
I think all
is good.
But stopIf(all=True)
does not sound like break. It appears to to apply the action to all substeps.
Well I preferred all
over break_all
, but it in fact is just break
, right? stop_if(break = True)
; or quit_step = True
, if you find break
a jargon?
stop_if(_index > 32)
is the same as
stop_if(_index == 32, break=True)
I complaint about the former because all substeps needs to be executed but at the same time _output
has a chance to be generated which makes
stop_if(_index > 32, keep_output=True)
a lot easier to implement.
So if we do not consider performance (which could be implemented in a more clever way, perhaps), I think
stop_if(_index > 32, keep_output=True)
if better than
stop_if(_index == 32, break=True)
stop_if(_index > 32)
is more natural to think of than using break
. But I thought we need break
for situations more complicated than using _index
. For example, when, say, remaining disk space is low, so break
will quit it once and for all. I think _index
is still a somewhat advanced feature we do not expect people to use a lot.
My concern is that
stop_if(all=True, keep_output=True)
is difficult to incomplete as we will have to evaluate all _output
anyway for keep_output=True
. Then the break
behavior of
stop_if(all=True, keep_output=False)
is troublesome to implement in the concurrent case because substeps before it might not have been submitted and substeps after it might have been running, and the only easy way to incomplete it is to disable concurrency in this case.
So in the end we do not have a good solution in either case and we do not have a convincing use case for it, perhaps we can keep stop_if
at the substep level for now.
Not saying we need it now, but the all
can be named skip_rest
and we can disallow the case skip_rest=True, keep_output=True
since it is conceptually difficult to understand (we cannot get output of skipped substeps). Then, skip_rest=True
will skip all the rest of the substeps.
Well, the interface is now stop_if(no_output=False)
. I argue that we should make no_output=True
or keep_output=False
by default. Also it would be nice if we can finalize the interface? The problem with skip_rest
is that in practice it is difficult to determine what it is meant by "rest" for concurrent process. I still feel "break" sounds better? Also for concurrent process, i'm not sure if skip_rest
or break
is well defined -- you might get random results every time you run it, depending on when the "stop_if" behavior is triggered.
So maybe let's finalize the interface by first deciding if we use stop_if(no_output=False)
and stop_if(keep_output=True)
? We can table the decision on break behavior if the random nature of it bothers?
The argument for stop_if(no_output=False)
is at #1161. Basically it is bad to use an operation that removes a user-generated output as default. skip_rest
is difficult to implement, and that is why I did not implement it.
Basically it is bad to use an operation that removes a user-generated output as default
I agree with this particular point. But my arguement was ususally stop_if
should happen before output is generated anyways: at least for most of my pipelines, if not all, outputs are generated at the end of the step. So typically when stop_if
happens there is nothing to remove anyways. But the problem is that with no_otput=False
these output
will be evaluated at the next step which will then complain about missing output thus breaking the pipeline altogether.
BTW I've got a worse example that i'm trying to make a MWE for -- I think stop_if(no_output=False)
is making my other concurrent substeps fail. Does nto sound like it makes sense but that's what I think is happening and I'm trying to confirm with MWE.
I think stop_if(no_output=False) is making my other concurrent substeps fail.
Cannot reproduce it now. Please ignore.
I think now I can live stop_if(no_output=False)
by default given everything considered. But can we change no_output
to remove_output
? That feels like removing it both from disk and from step_output
.
I considered remove_output
and discard_output
but these implies that the output is already generated and should be removed. In contrast no_output
can implies both.
So we have
no_output=False
, default behavior.- If output generated, good, keep it.
remove_output=False
is appropriate. - if output is not generated, error because there is supposed to be output.
remove_output=False
is not quite ok.
- If output generated, good, keep it.
no_output=True
,- if output generated, remove it as user asks,
remove_output=True
is ok. - if output not generated, do nothing.
remove_output=True
is not quite ok because output was not generated in the first place.
- if output generated, remove it as user asks,
But all these are subtle and tend to be confusing and are why we could not settle down with a good set of names. Perhaps it is good to separate the use cases and say something like
output:
produce_output
done_if() # already done (output generated)
and
output:
skip_if() # skip substep (and no output has been generated)
produce_output
but these implies that the output is already generated and should be removed.
they also imply output will be removed from step_output
which is why I am in favor of those names. no_output
does not quite highlight this. I think it is more important to highlight from workflow point of view.
Perhaps it is good to separate the use cases and say something like
In my head I already considered stop_if(no_output=True)
equal to skip_if
. I'm not quite sure what done_if
is for (i know what it does but not sure if it worth separating). I think stop_if
behavior already covers at least my two user cases with no_output=True
and False
. I only have issues with the argument name that does not highlight consequence to step_output
, and the fact it breaks following pipelines by default in a not so peaceful way (fail with error); hence I felt skip
behavior is more approperate by default but i do not have a strong opinion on it.
[1]
input: for_each = {'i': range(10)}
output: ...
stop_if(i == 4)
[2]
...
[GW] sos run test1.sos
INFO: Running 1:
INFO: output: 0.txt 1.txt... (10 items in 10 groups)
ERROR: [1]: Output target 4.txt does not exist after the completion of step 1
[default]: 1 pending step: 2
My point was that stop
does not automatically imply if the output has been generated, and the users have to think of the use of parameter no_output
or remove_output
, which is technical details. Instead of the parameter, perhaps we can just write done two cases
output:
produce_output ...
done_if()
here done
means what we want to achieve for the substep is already done, namely output already created, and
output:
skip_if()
produce_output...
here skip
means we do not want to run the substep
, which naturally means no _output
.
If we introduce the actions with the complete patterns (order of output, action, and produce_output), perhaps it is much easier for users to understand.
Sure I think we are on the same page. It is just that I'm already used to thinking of done_if
= stop_if
and skip_if
= stop_if(no_output=True)
so I'm indifferent towards introducing these two additional options: to me they are just alias.
My workflow example above tries to make a case that fails weird, which I now use done_if
to show that the problem still is there:
[1]
input: for_each = {'i': range(10)}
output: ...
done_if(i == 4)
generate_output(...)
[2]
...
the point is there is no restriction where done_if
appears so in this case it breaks [2]
in a not so clean way. Or do you consider that user error, or expected behavior?
That would be an user error because it says the step is done before the output is actually generated.
That would be an user error because it says the step is done before the output is actually generated.
Okay now I agree the name done_if
is intuitive in implying it is a user error.
So to summarize:
- we want to keep
stop_if
as is for backwards compatiblilty but we stop documenting and using it - we introduce
done_if
andskip_if
- behavior of
fail_if
andwarn_if
remains unchanged
right?
Yes, that sounds better than stop_if(no_output=True/False)
. We have swing between different options quite a few times now, let us finalize it this way.