keikoproj/active-monitor

Update healthcheck spec and controller to support automatic "remediation"

davemasselink opened this issue · 3 comments

Currently, Healthcheck custom resource and its child argo workflow can "detect" if there is a problem.

However, there is no place to express what action to take in case of a problem.

Therefore, healthcheck spec should support an alternative argo workflow for "remediation". If the main workflow fails, the "remediation" workflow should be run. Also, additional remediation metrics should be captured and exposed accordingly.


Open Source software thrives with your contribution. It not only gives skills you might not be able to get in your day job, it also looks amazing on your resume.

If you want to get involved, check out the
contributing guide, then reach out to us on Slack so we can see how to get you started.

@codetamaracode, would you be interested in contributing to this issue?

Can i pickup this issue? Please let me know.

a few more thoughts and notes on this issue based on follow-up conversations:

  • what is best high-level approach?
    • designate a "remediation" argo workflow which is executed upon each "core" workflow's failure
    • collect metrics from "core" workflow and execute "remediation" workflow only once certain metric thresholds are met
  • would there be situations where a "remediation" workflow requires a different set of permissions than the "core" wf? If so, would it be alright to support this or might there be priv. escalation security concerns?