Handle case where clear redfish job queue fails

Question

Handle case where clear redfish job queue fails

smalleni opened this issue 4 years ago · 0 comments

In some cases, when the lifecycle controller is unresponsive, we see things like

Tuesday 08 September 2020  11:13:23 -0400 (0:00:01.885)       0:00:56.108 *****
changed: [e23-h24-b01-fc640.rdu2.scalelab.redhat.com] => (item=e23-h24-b02-fc640.rdu2.scalelab.redhat.com)
changed: [e23-h24-b01-fc640.rdu2.scalelab.redhat.com] => (item=e23-h24-b03-fc640.rdu2.scalelab.redhat.com)
changed: [e23-h24-b01-fc640.rdu2.scalelab.redhat.com] => (item=e23-h24-b04-fc640.rdu2.scalelab.redhat.com)
FAILED - RETRYING: Clear redfish job queues (3 retries left).
FAILED - RETRYING: Clear redfish job queues (2 retries left).
FAILED - RETRYING: Clear redfish job queues (1 retries left).
failed: [e23-h24-b01-fc640.rdu2.scalelab.redhat.com] (item=e23-h26-b01-fc640.rdu2.scalelab.redhat.com) => {"ansible_loop_var": "item", "attempts": 3, "changed": true, "cmd": "source /tmp/ansible.i18syj8m/.venv/bin/activate\n./src/badfish/badfish.py -u quads -p rdu2@244 -i config/idrac_interfaces.yml -H mgmt-e23-h26-b01-fc640.rdu2.scalelab.redhat.com --clear-jobs --force\n", "delta": "0:00:00.396239", "end": "2020-09-08 15:15:05.763297", "item": "e23-h26-b01-fc640.rdu2.scalelab.redhat.com", "msg": "non-zero return code", "rc": 1, "start": "2020-09-08 15:15:05.367058", "stderr": "- ERROR    - Failed to communicate with mgmt-e23-h26-b01-fc640.rdu2.scalelab.redhat.com\n- ERROR    - There was something wrong executing Badfish.", "stderr_lines": ["- ERROR    - Failed to communicate with mgmt-e23-h26-b01-fc640.rdu2.scalelab.redhat.com", "- ERROR    - There was something wrong executing Badfish."], "stdout": "", "stdout_lines": []}
changed: [e23-h24-b01-fc640.rdu2.scalelab.redhat.com] => (item=e23-h26-b02-fc640.rdu2.scalelab.redhat.com)

In these cases, a racreset might help. So what we want to do is, get a list of hosts where clearing redifsh job queues failed, and run badfish to do a racreset on those hosts, before doing the --check-boot and subseuqnet boot-order manipulation. Another option is to also use the --ls-jobs to only clear jobs if any jobs exist in the first place.