StackStorm/st2

Orquesta - Support output-on-error semantics

nmaludy opened this issue ยท 20 comments

SUMMARY

In Mistral workflows there was an option for an output-on-error stanza. This block allowed workflow authors to return data in the event of an error within the workflow. We use this feature internally do things like:

  • Returning meaningful error messages from workflows
  • Returning the data that caused the error from a workflow
  • Returning a list of tasks that completed and those that failed within a variable (used for emails, ticketing systems and dashboards)

I would be open to a different mechanism (does not need to be output-on-error). But some way to replicate this functionality would be great!

ISSUE TYPE
  • Feature Request
STACKSTORM VERSION
$ st2 --version
st2 2.8.1, on Python 2.7.5
OS / ENVIRONMENT / INSTALL METHOD
OS = CentOS 7.5
Install method = puppet-st2
Example Mistral workflow
encore.preprovision:
  type: direct
  input:
    - domain_type
    - dns_domain
    - hostname
    - task_list: []
    - vm_os

  output:
    ip_address: "{{ _.ip_address | default(None) }}"
    task_list: "{{ _.task_list }}"
    vm_dns_records: "{{ _.dns_records | default(None) }}"
    vm_os_name: "{{ _.vm_os_name | default(None) }}"
  output-on-error:
    task_list: "{{ _.task_list }}"
    vm_dns_records: "{{ _.dns_records | default(None) }}"

  tasks:

...

    validate_ipam_hostname_unique:
      action: encore.mm_check_hostname
      input:
        hostname: "{{ _.hostname }}"
        dns_domain: "{{ _.dns_domain }}"
      publish:
        task_list: "{{ _.task_list + [{'name': 'validate_ipam_hostname_unique', 'status': 'SUCCESS'}] }}"
      publish-on-error:
        task_list: "{{ _.task_list + [{'name': 'validate_ipam_hostname_unique', 'status': 'ERROR', 'err_reason': 'Host already exists with name: ' + _.hostname + '.' + _.dns_domain }] }}"
      on-success:
        - validate_ad_computer_unique: "{{ _.vm_os.type == 'windows' and _.domain_type.lower() == 'domain' }}"
        - build_dns_records: "{{ _.vm_os.type != 'windows' or _.domain_type.lower() != 'domain' }}"
      on-error:
        - send_error_email
...

@nmaludy We are aware of this requirement. Post v2.9, we are planning to implement a variation of this. The following is our current thinking. We would extend the fail command and allow user to provide additional parameters. Other suggestion is welcomed. Although just be aware that I'm not a big fan of Mistral's *-on-error or any of the on-condition semantics.

version: 1.0

description: A workflow demonstrating fail command.

tasks:
  task1:
    action: core.local
    input:
      cmd: '>&2 echo "boom!"; exit 1'
    next:
      - when: <% failed() %>
        publish:
          - err_msg: 'task1 failed'
          - extra_data: <% result().stderr %>
        do: fail(<% ctx(err_msg) %>, extra=<% ctx(extra_data) %>)

Here's another variation. We can add a log entry to be included in the errors of orquesta workflow result. Orquesta workflow currently has list of errors and output in the result.

version: 1.0

description: A workflow demonstrating fail command.

tasks:
  task1:
    action: core.local
    input:
      cmd: '>&2 echo "boom!"; exit 1'
    next:
      - when: <% failed() %>
        publish:
          - err_msg: 'task1 failed'
          - extra_data: <% result().stderr %>
        log:
          - message: <% ctx(err_msg) %>
            extra: <% ctx(extra_data) %>
        do: fail

And the action execution would look something like the following where the errors is automatically captured by the engine and the logs are user generated.

id: 5b996bfe0a08a4187eca165c
action.ref: foobar
status: failed
start_timestamp: Wed, 12 Sep 2018 19:41:50 UTC
end_timestamp: Wed, 12 Sep 2018 19:41:51 UTC
result: 
  logs:
    - message: task1 failed
      extra: boom!
      task_id: task1
  errors:
    - message: Execution failed. See result for details.
      result:
        failed: true
        return_code: 1
        stderr: boom!
        stdout: ''
        succeeded: false
      task_id: task1
  output: null

@dzimine @bigmstone I would like your input on the design above. I kinda prefer the latter option.

@m4dcoder: In @nmaludy's example, the output-on-error is at the workflow level, not the task level. It's a catch-all, what to do if something in the workflow itself fails, not if an individual task fails.
https://specs.openstack.org/openstack/mistral-specs/specs/ocata/approved/publish-on-failure.html

(@nmaludy I haven't actually used this before - is that right?)

@cognifloyd IMO, if the workflow fails, then there's no output (or return value) from the execution. Think of this as a function call. If the function fails, it'll return errors or print errors to console, but it doesn't have a return value. Therefore, if the workflow author wants to output something for troubleshooting, the option for a log command that I propose above let's user write messages and any extras. Since the placement of the log command is coupled with the condition of the task execution, it allows for user to determine when to write the log entries. Please remember that an error could also be a like HTTP where the execution is successful but with an error code. In this case, output-on-error or publish-on-error semantics won't be able to catch them. Then finally, on completion of the workflow execution, regardless of success or failure, we will return the log entries in the workflow result in a log section like we do for errors.

@cognifloyd also, orquesta handles unexpected exceptions at the workflow level and write error message under errors in result. I also have seen what @nmaludy do in his workflows. So he can use log to log progress with any extra data and then if workflow fails unexpectedly, the log progress + data + errors will be in the workflow result.

I'd prefer to have the option to do output-on-error with Orquesta too, or perhaps just to always do it. In one example I have a workflow that runs a remote command. If the remote command fails I still want to be able to get the stdout and stderr and exit_code of the remote command, just like when the workflow succeeds. What you are suggesting is that now I need conditional logic to get stdout from my remote command:

    if my_workflow.status == 'succeeded':
        stdout = my_workflow.result.output.stdout
    else:
        stdout = my_workflow.log.extra.stdout

That's really sub-optimal for the use case where you always want output, regardless of whether the workflow succeeds or fails. I assume that's why mistral supports it.

You don't have to fail the workflow in your example. Just let the workflow succeeds and write the error code and stdout to the workflow output. Think of your use case as HTTP calls.

Not all actions or workflows are going to conform to a HTTP interaction pattern. If one-shot ssh commands always returned "success" even if the command failed on the remote end, and someone had to look at different data to determine the result, all ssh pipelines would be much more complex than they are today. For the workflow type I am implementing, that is the case.

The canonical way to understand if a workflow succeeded or failed is through when: <% failed() %> or `when: <% succeeded() %>. Your suggestion introduces a new paradigm which I don't think is a good idea to do. A workflow that always succeeds, but what it did may have failed? We should continue to use the overall status in the way it was intended.

I still believe it would be better to honor the output on failure and keep it in the same place. Omitting this is a regression from the mistral engine. It's not my call, I can only voice my thoughts on it. If this is the way you want to handle it then once the feature is added I'll just add next handlers in my workflows to deal with the fact that the output might be in two different places depending on the status of the workflow. What's been proposed here makes workflows that handle I/O more difficult and is a regression from mistral's provided functionality.

@jeking3 We are not trying to duplicate Mistral exactly and we are very selective what we introduced into Orquesta. We will revisit features and requests from time to time.

To me, not giving users the option to publish output on workflow failure is counter intuitive and goes against common sense. Similar to running an executable and not getting any output at all if it fails for any reason, not even an exit code.

Failed workflows should let users publish any arbitrary variables with all context available.
Even when implementing HTTP calls any framework lets you respond with non-200 status code and any headers and body added to it.

I can elaborate more on all this, but making Orquesta limited in this regards makes it way harder to migrate from Mistral.

Output from workflows is imperative for chatops aliases, when you need to pass various results from workflows to users. Re-writing mistral workflows so that instead of failing we go to a core.noop task where we publish variables and only after that they go to output is not feasible, let alone it masks their logical end of life in the UI and adds more tasks to workflows where not needed.

@shusugmt So I did some investigation with the Mistral output. Let's get the fact straight. By default, if no output or output-on-error is defined in a Mistral workflow, Mistral will automatically output the entire workflow context. This is a very BAD design IMO because most of the time the workflow context contains secrets. In the case when the workflow fails, you see the variables which you think you output is in fact the workflow context because output-on-error is not defined in the workflow per the info you provided.

@emptywee Please review the entire issue. We hear that users want to have some sort of output in the result to be consumed. We currently don't support output-on-error schema and I am not backing down from this. However, we proposed the log schema in the task transition. Whatever that is logged will be available in the result which will be structured as the following: {'log': [...], 'errors': [...], and 'output': {...}}. This should be very useful because now users can log additional message or data that will be included in the result regardless of workflow state. Each log entries will have specific task id and task transition id to help with identifying source of the message.

So given the contention and misunderstanding on the subject, what I will be doing is to let orquesta process the output section when the workflow completes (regardless of success or failure). The values will be evaluated based on the final workflow context at time of workflow completion. If there is an evaluation failure on a key value pair, then the key value pair for the output will be skipped and an error entry will be appended in the result describing the evaluation error. I think this is sufficient in addition with the log schema in task transition. Uers should be able to take advantage with the change in output and log for troubleshooting and other advanced uses.

@m4dcoder I don't mind having the log thing for whatever debugging purpose it may be, but I was told that orquesta workflows, if they fail, don't process output at all. I haven't tested it myself, but we spoke about it on Slack and I was asked to provide my feedback on the issue.

As to publish on error for tasks... I haven't used it for my mistral workflows since it wasn't really supported by the mistral engine forked into st2, so it won't be a blocker for possible migration to orquesta for me, but it's a nice feature to have anyway, I can see how useful it may be. Logs are logs, output is output, to me they are two different entities with different purposes. Shouldn't be mixed.

@emptywee please review my comments just above that we plan to change when we process output. that should satisfy the contention here.

@m4dcoder I did, thanks.

Workflow output addressed at StackStorm/orquesta#101 and #4436

I have tested the fix with the package from #4436 and LGTM. I could migrate our current mistral workflow which were dependent on this behavior, to orquesta and all runs as expected. Thanks for the rapid fix!!