grafana/xk6-disruptor

Fail fault injection if no traffic has been intercepted

Closed this issue · 5 comments

There are different cases in which a misconfigured fault injection (e.g incorrect target port), or the usage of xk6-disruptor in an unsupported scenario (#214) causes the fault injection not to have any effect. The test would run without errors but the results would be incorrect.

There is an initiative to improve this situation by introducing validations before the fault injection (#217) but there will always be possibilities of misconfigured or unsupported usage.

Therefore, it would be convenient that the agent reported an error if the fault injection did not affect any traffic and the disruptor extension should report this as a warning.

Note: Presently the disruptor extension reports all non-successful return codes from the agent as an error. Therefore, some mechanism is needed for signaling non-fatal errors in the agent.

roobre commented

A simple solution coming to my mind could be for the agent to output to stdout a machine-parsable message (e.g. json) that the disruptor can parse. This message could include logs and stats, e.g. number of disrupted requests.

When we implement daemon mode (do we have an issue for this?), this could be replaced with a stats command that does the same, but reading from the agent's API instead of being echoed when the command finishes.

I was thinking about something even simpler, return a status code from the agent command that indicates the agent did nothing. This is similar to what some Linux commands do. IIRC, awk does it and I think grep also does something similar.
What I'm not sure of at this moment is how to get this RC when executing a command using k8s client library.

Parsing output is generally messy and I don't find returning structured json from a CLI a good practice.

As it seems not possible to get the exit code from exec to a pod, we can use an intermediate solution and actually output something to stderr but use some form of structure that differentiates errors from warnings, similar to a log format:

WARN: no traffic processed ...

roobre commented

Taking a look at this now that #274 has been merged.

Implementation on the agent side is more or less trivial as we simply need to print things to stderr. The communication between agent and k6 seems to be the delicate topic in this case.

The current implementation always discards stdout, and discards stderr if the Exec API call does not return an error:

_, stderr, err := c.helper.Exec(pod.Name, "xk6-agent", cmd, []byte{})
if err != nil {
errors <- fmt.Errorf("error invoking agent: %w \n%s", err, string(stderr))
}

We would need to change this, where I see two options:

  1. Give agentController the ability to log messages, so stderr can be logged even if a nil err is returned.
  2. Propagate both stderr and error up, and handle logging higher in the call stack.

I think option 2 would require more changes, so I'm leaning towards option 1 for now.

Regarding what to do exactly with the contents of stderr, I again see two big options:

  1. Make the agent log to stderr in JSON format, including at least message and log level, and parse those messages somewhere (likely in the agentController).
  2. Log in a custom, more lenient format. For example, something akin to Printf("%s: %s", level, message), and then attempt to parse known prefixes in the agentController. If the prefix is not recognized, assume a default level such as INFO.

I'm fairly neutral between these two. I don't like JSON logging as its very unreadable for humans, but I see it becoming more and more standard as JSON is pretty much universally parsable. Moreover, the lenient parsing mode could more easily hide bugs, such as misspelled log levels (e.g. WARN vs WARNING) leading to annoying bugs that with JSON are easier to catch.

Looking forward to some thoughts on this @pablochacin 🙂

First of all, I'm unsure if we should consider no traffic disruption as a warning that requires a human to check the output. I think an error that makes the test fail is a better option.

If we consider this case an error and not a simple warning, then no further changes are required in the disruptor, because the errors are propagated and reported. In this case, I think printing a message to the agent's stderr would also be the simplest solution because it won't require any parsing on the controller's side.

For simplicity, I prefer this option as it offers most of the value of the requirement with a minimum of changes.

However, as we eventually would like to introduce some logging for debugging, I'm not opposed to the second option.

If we want to inform the user with a warning, we should then use the k6 logger. Introducing it would require changes in the constructors of the Disruptors, the AgentController, and the API (interface between the disruptors and k6/goja).

We would also have to parse the response from the agent in the controller to know if it is an error or not. In this case, I don't have a strong opinion about returning JSON from the agent or writing to stderr, but I prefer the latter, for simplicity. We can use a logging library to avoid the misspelling errors you mention.

If we decide to follow this option, I would prefer not to pass the logger directly to the disruptor constructor, but use some abstraction such as "Runtime" or "Environment" that will allow us to pass other arguments from k6 to the disruptor in the future (e.g. configuration value), but this is a detail we can consider later.