grafana/xk6-disruptor

Disruptor does not affect traffic sent via `kubectl port-forward`

Closed this issue ยท 10 comments

roobre commented

Note: This is the outcome of researching why https://github.com/grafana/quickpizza/pull/2/files was not working as expected. Thanks @ppcano for reporting this to us!

The current configuration for the disruptor will cause it to not disrupt traffic sent to the pods by means of kubectl port-forward, no matter if the target of the port-forward is a pod or a service.

The reason for this is that the disruption is limited to traffic flowing throught the eth0 interface, as restricted by the iptables command here:

const redirectCommand = "%s PREROUTING -t nat -i %s -p tcp --dport %d -j REDIRECT --to-port %d"

And defaulted here:

cmd.Flags().StringVarP(&iface, "interface", "i", "eth0", "interface to disrupt")

This is because when using kubectl port-forward, traffic is forwarded through the lo interface. This can be checked by running kubectl port-forward and then tcpdump on the pod, observing that traffic only shows up when capturing in lo (-i lo):
image

The solution for this is non-trivial.

โŒ Workaround attempt with iface: "lo"

A tempting way to workaround this would be to specify iface: "lo" in the disruptor configuration. This, unfortunately, will not work, as the disruptor proxy itself uses localhost (thus lo) as upstream:

upstreamHost := "localhost"

If attempted, the generated REJECT rule:

const resetCommand = "%s INPUT -i %s -p tcp --dport %d -m state --state ESTABLISHED -j REJECT --reject-with tcp-reset"

shuts down this otherwise infinite connection loop:

# Curl to upstream directly is refused by the REJECT rule.
/ # curl 127.0.0.1:3333
curl: (7) Failed to connect to 127.0.0.1 port 3333 after 0 ms: Couldn't connect to server

# Curl to the proxy echoes the error the proxy sees when its connection to upstream is
# severed by the same rule.
/ # curl localhost:8080
Get "http://127.0.0.1:3333/": dial tcp 127.0.0.1:3333: connect: connection reset by peer/ #

The test, however, will appear to work with neither disruptions nor resets. This is because kubectl port-forward will forward the target port as both v4 and v6 address. k6 will see a connection reset while attempting to connect to 127.0.0.1:3333, as per the REJECT rule above. When this happens, k6 will automatically fall back to [::1]:3333, which is exempt from the disruption, and thus work normally without neither disruption nor reset.
image

Moving forward

There are several paths forward, none of them trivial:

  1. Document the problem and leave disruption of port-forward out of the scope of the problem.
  2. Allow to use the pod IP, rather than localhost, as upstream, so the iface: "lo" workaround works.
    a. This will have bad UX
  3. Send proxy traffic from a specific address, and exclude that address from iptables redirection

Document the problem and leave disruption of port-forward out of the scope of the problem.

I think this should be done as a temporary measure, as the solution to the problem may take time to implement and test properly, to avoid users facing the same issue to waste time debugging the problem.

Allow to use the pod IP, rather than localhost, as upstream, so the iface: "lo" workaround works.
a. This will have bad UX

I've not checked if this will actually solve the issue, but assuming it will, I don't see why this sould be a bad UX, as the IP can be detected by the agent and used in the command.

roobre commented

I've not checked if this will actually solve the issue, but assuming it will, I don't see why this sould be a bad UX, as the IP can be detected by the agent and used in the command.

I probably should have elaborated further. I see two main issues with this approach:

The first one is that if add this way to change the target upstream (e.g. targetPodIP: true (or something similar)), this option effectively requires setting iface: "lo". If iface is set to lo, but targetPodIP is false (default), the pod loses connectivity. Likewise, if iface is set to eth0 (default) and targetPodIP is true, the pod also loses connectivity. We could define a higher-level option, like local: true, as an abstraction to solve this.

The second (and probably more important) problem is that we cannot disrupt the same address we use as upstream. So if we use the podIP as upstream, we will not be able to disrupt regular traffic, just localhost (port-forward) traffic. Having to specify this on the test does not feel like good UX to me, specially because it may fail if the deployment mode is changed later.

roobre commented

Thinking more about this, it might be possible to enable this without UX compromises by complicating the setup and iptables rules. The rough idea I have is to:

First, we add a new local-only IP to the pod, e.g. 127.6.0.1/8, which the proxy will listen on. Then, we will modify the proxy code to egress traffic from this address (net.Dialer.LocalAddr)

Then, we modify the iptables rules so they look like the following:

-A PREROUTING -t nat -p tcp --dport {{upstreamPort}} ! -s 127.6.0.1/32  -j DNAT --to-destination 127.6.0.1:{{proxyPort}}"
-A INPUT -i %s -p tcp --dport {{upstreamPort}}  ! -s 127.6.0.1/32 -m state --state ESTABLISHED -j REJECT --reject-with tcp-reset"

The effect of this would be that all traffic directed towards the upstream application will be redirected to the proxy, except the traffic from the proxy itself (as per ! -s 127.6.0.1/32).
This removes the need to specify an interface (it can still be added if we find it valuable, but it's not needed), and redirects traffic to the real application except if that traffic comes from the proxy.

The second (and probably more important) problem is that we cannot disrupt the same address we use as upstream. So if we use the podIP as upstream, we will not be able to disrupt regular traffic, just localhost (port-forward) traffic. Having to specify this on the test does not feel like good UX to me, especially because it may fail if the deployment mode is changed later.

I think that if we document the issue for pods exposed using port forward and we define a simple option that addresses this issue (similar to the idea of the local option you suggested above) the UX is acceptable.

Also, I think we don't need to support disrupting traffic from the port forwarding and the regular traffic at the same time. This seems like a reasonable limitation to me.

First, we add a new local-only IP to the pod, e.g. 127.6.0.1/8, which the proxy will listen on.

Do we need to create a new IP address? if the proxy always binds to the pod IP and the iptables rules exclude this address as a source IP, I think the result would be the same. The traffic coming to the Pod through the eth0 would never have this IP address as a source address (the traffic is originated outside the pod)

roobre commented

Do we need to create a new IP address? if the proxy always binds to the pod IP and the iptables rules exclude this address as a source IP, I think the result would be the same.

It sounds like this should be the case, but using an IP that already has a different purpose and is perceived to be external for this seems... strange. It might be confusing even for us why we decided to do this.

An extra bonus of using a dedicated IP, we reduce chances of collision with applications.

First, we add a new local-only IP to the pod, e.g. 127.6.0.1/8, which the proxy will listen on.

Which interface will have this IP address?

roobre commented

Which interface will have this IP address?

My idea was to add it to lo.

roobre commented

This was solved, albeit partially, in #231. Should we close this and leave #254 open?