A probe set to stop on failure does not stop the experiment
Opened this issue · 0 comments
What happened:
An experiment with a probe set to "stop on failure" does not stop the experiment if it detects a failure. The experiment continues to run for the specified duration instead. I've tried this out with a pod-delete fault and an HTTP probe.
There are two lines logged in the pod running the fault. First this:
time="2024-12-13T15:32:10Z" level=error msg="The myapp http probe has been Failed, err: {\"errorCode\":\"HTTP_PROBE_FAILURE\",\"phase\":\"ChaosInject\",\"reason\":\"Actual value: 503. Expected value: should be equal to 200\",\"target\":\"myapp\"}"
This is expected when the probe fails.
The next line is where the problem occurs:
time="2024-12-13T15:32:10Z" level=error msg="Unable to patch chaosengine to stop, err: {\"errorCode\":\"HTTP_PROBE_FAILURE\",\"phase\":\"ChaosInject\",\"reason\":\"Actual value: 503. Expected value: should be equal to 200\",\"target\":\"myapp\"}"
What you expected to happen:
The experiment should have been interrupted. The Chaos Engine should have been set to stop.
Where can this issue be corrected? (optional)
I believe this can be fixed in the probe logic in litmus-go
. It looks like the fact that there is a fault (that triggers the code to stop the experiment) also skips the step of setting the chaos engine to stop, because there is an "error".
How to reproduce it (as minimally and precisely as possible):
- Create a pod-delete experiment that will fail.
- Create an http probe that can detect the failure.
- Set the probe to stop on failure.
- Run the experiment.
Tested on Litmus 3.13.0
Anything else we need to know?: