Timeout when connecting to same target due to iptables masquerading issue

Question

Timeout when connecting to same target due to iptables masquerading issue

ywei2017 opened this issue 3 years ago · 6 comments

ywei2017 commented 3 years ago

Issue

Especially during load testing, a container (or potentially multiple containers) on a diego_cell will try to reach out to a common target. We observe connection timeouts, though the target is on the same network.

We ran a tcpdump in the CF container, and can see the packet pattern associated with TCP handshake timeout (repeated syn without reply
Then we ran a network sniffer, which doesn't reveal any pattern of TCP handshake failure. So we concluded the error is all within the diego_cell itself.

Context

This is observed in our dev, stage and prod environments. We recently observed that the occurrence is at higher rate during load testing, but definitely happens without load testing.

Steps to Reproduce

Run a load test on a container, and the container would create outbound TCP connection

Expected result

Requests should succeed with similar response time.

Current result

We see random TCP connection timeouts, and slow responses.

Possible Fix

Update IP tables to use random port for SNAT.

Additional Context

I found the article in the K8s context, and assume the same applies for CF. If not we may need to dig elsewhere.

Answer 1 · 2021-05-11T22:44:56.000Z

We have created an issue in Pivotal Tracker to manage this. Unfortunately, the Pivotal Tracker project is private so you may be unable to view the contents of the story.

The labels on this github issue will be updated when the story is started.

Answer 2 · 2021-05-13T15:43:38.000Z

Hi @ywei2017,

First thoughts

Reading through the provided article, it seems to be that there might be a performance problem if multiple containers on a diego_cell bind to the same SNAT port; in the example, they provided an example of two containers binding to port 32000. I don't think this is true for cf.

Trying it Myself

I wanted to follow the same instructions that the article suggested to see if conntrack showed the same SNAT port when making outbound tcp connections.

To setup the experiment, I first started with a fresh cf-deployment environment and pushed up two apps:

a proxy with the hostname rutabaga
a dora with the hostname dora

Then I created the internal route and policy so the proxy could reach dora via the internal route domain:

$ cf map-route dora apps.internal --hostname dora
$ cf add-network-policy rutabaga dora

Now we observe that our rutabaga proxy can reach the dora:

$ curl -k https://rutabaga.lasallegreen.cf-app.com/proxy/dora.apps.internal:8080
Hi, I'm Dora!

Observing `SNAT` ports via `conntrack`

I found the internal IP of the dora container via:

$ cf ssh dora -c 'ip a'
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    ...
86: eth0@if87: <BROADCAST,MULTICAST,NOARP,UP,LOWER_UP> mtu 1410 qdisc noqueue state UP group default
    link/ether ee:ee:0a:ff:8f:2b brd ff:ff:ff:ff:ff:ff link-netnsid 0
    inet 10.255.143.43 peer 169.254.0.1/32 scope link eth0
       valid_lft forever preferred_lft forever

Now, I can generate outbound connections, list them, and see which SNAT port it tries to use:

$ for i in `seq 0 5`; do curl -k https://rutabaga.lasallegreen.cf-app.com/proxy/dora.apps.internal:8080/delay/3 && echo; done
YAWN! Slept so well for 3 seconds
YAWN! Slept so well for 3 seconds
YAWN! Slept so well for 3 seconds
YAWN! Slept so well for 3 seconds
YAWN! Slept so well for 3 seconds
YAWN! Slept so well for 3 seconds

diego-cell/1286acdd-aec3-408f-b488-67838e857984:~# conntrack  -L -d 10.255.143.43
tcp      6 83 TIME_WAIT src=10.255.143.44 dst=10.255.143.43 sport=34918 dport=8080 src=10.255.143.43 dst=10.255.143.44 sport=8080 dport=34918 [ASSURED] mark=0 use=1
tcp      6 86 TIME_WAIT src=10.255.143.44 dst=10.255.143.43 sport=34926 dport=8080 src=10.255.143.43 dst=10.255.143.44 sport=8080 dport=34926 [ASSURED] mark=0 use=1
tcp      6 70 TIME_WAIT src=10.255.143.44 dst=10.255.143.43 sport=34890 dport=8080 src=10.255.143.43 dst=10.255.143.44 sport=8080 dport=34890 [ASSURED] mark=0 use=1
tcp      6 67 TIME_WAIT src=10.255.143.44 dst=10.255.143.43 sport=34884 dport=8080 src=10.255.143.43 dst=10.255.143.44 sport=8080 dport=34884 [ASSURED] mark=0 use=1
tcp      6 89 TIME_WAIT src=10.255.143.44 dst=10.255.143.43 sport=34932 dport=8080 src=10.255.143.43 dst=10.255.143.44 sport=8080 dport=34932 [ASSURED] mark=0 use=1
tcp      6 64 TIME_WAIT src=10.255.143.44 dst=10.255.143.43 sport=34872 dport=8080 src=10.255.143.43 dst=10.255.143.44 sport=8080 dport=34872 [ASSURED] mark=0 use=1

As we can see, the outbound sport ranges from 34872 to 34932, none of them are using the same SNAT port.

Next Steps

As far as I understand, this does not apply to CF.

❓ Could you please provide more information on how you reproduced this issue in your environment(s). Ideally an example app and some simple steps would be helpful.

Thanks,
@jrussett

Answer 3 · 2021-05-13T15:59:38.000Z

@jrussett -- Thanks for the response. Let me cogitate it more and see whether I can construct some way to check it out. If I read the k8s article right, the s-port should be different, but 2 containers could try to use the same and run into a race condition, due to how iptable works. The conntrack is a cool tool I haven't used. So I can play with it.

Answer 4 · 2021-05-17T23:31:05.000Z

@jrussett - A quick update.

I have done more digging over the past few days, mainly trying to reproduce the issue. I got some success. From all I can see, it seems to be a TCP port reuse issue.

The app in a Diego Cell connects to the common target (an F5 endpoint).
If the server (F5) disconnects (for whatever reason), the socket in F5 stays in TIME_WAIT for a duration. In our case, most often is idle timeout.
If any app in the Diego Cell tries to connect to the target, using the same source port (which is randomly assigned), the connection would timeout.

I think the issue is always possible, but the likelihood increases for a busy Diego Cell, where multiple containers tend to hit a popular, common target, which is our case.

We are taking several actions.

Work with the F5 team to investigate the TIME_WAIT duration.
Work with our app teams to close idle connections (instead of letting the server closing it).

Thanks for the investigation.
I think we can close the issue.

Answer 5 · 2021-06-30T21:05:35.000Z

@jrussett - In case you wonder, we finally found out the reason. In case the server (F5) terminates the connection , and the socket is in FIN_WAIT_2 stage, the client should send in the FIN packet to close the connection. If somehow the client is busy, and the FIN packet is delayed or never sent, the server (F5) won't accept new connections on the same source port during the FIN_WAIT_2 stage. F5 actually has an article on this. I re-examined the tcpdump captures and they all match.

Thanks
Yansheng

Answer 6 · 2021-12-30T21:08:02.000Z

This has been confirmed as an F5 config issue.