Old TCP flows are in `FLOW_WAIT`, blocking port reuse
Closed this issue · 2 comments
@citrus-it observed that outbound TCP flows will occasionally hang while running a package sync operation from an OmniOS VM in Colo to both a DigitalOcean-hosted and Oxide-hosted mirror of the package repository. This workload is notable for spinning up a large number of short-lived TCP flows. Specifically the host will remain in SYN-SENT
, none of the (repeated) SYN packets will be seen by the other host, and the OPTE state is caught in TIME-WAIT
fo an unbounded time.
An initial reproduction hit this using curl -I <site>
in a bash loop, but we found that this can be consistently triggered using curl -I --local-port 56000 <site>
:
- Invoke curl. HTTP HEAD request completes successfully.
a.opteadm dump-tcp-flows
shows flow is in stateTIME-WAIT
. - Invoke curl again.
a.snoop
shows SYNs being sent w/ exponential backoff. Reply hangs.
b.opteadm dump-tcp-flows
shows flow remains in stateTIME-WAIT
. Out byte/pkt count seems to increment.
c.kstat
for port in question showsout_drop_tcp_err
increments in time with SYNs. - Ctrl-C.
a.snoop
shows RST sent.
b.opteadm dump-tcp-flows
shows flow is cleared. - GOTO 1.
Currently we're not doing any periodic pruning of TCP flow table entries, which makes it quite likely that one will encounter this on a long-running VM (or short-running if using pkg
!).
- We could do periodic pruning, but this presents the issue that we don't know exactly what a guest's TCP_WAIT timer configuration is.
- The alternative is that, at least in the outbound case, a new SYN should always overwrite flow-state. I'd need to think more about whether it's sane to do this on inbound flows.
Thanks for writing this up.
I think we should also do periodic pruning of at least TIME_WAIT entries as otherwise a source port that's only used once by a guest persists in the tcp flow table and a guest with multiple IP addresses could presumably use it all up. The challenge here is on what expiry time to pick as the guest's timer will vary -- Linux and illumos seem to default to 60s, but that can of course be tuned and I've seen it tuned in the field certainly down to 20s -- and Windows and MacOS default to 120s.
As I understand it, this is about how long to wait around for any remaining ACKs on the connection, so setting the expiry to 120s in conjunction with flushing it immediately if a new SYN packet appears from the guest, seems like a reasonable starter for 10.
I took a closer look at the problem reported in https://github.com/oxidecomputer/customer-support/issues/25 to see if it is the same or a similar problem.
I used curl
on atrium to ping the API with a source port of 40000. Before doing that, I checked that there is no entry in the TCP flow table for this port:
BRM44220011 # opteadm dump-tcp-flows -p opte0 | grep 40000
BRM44220011 #
atrium% curl --local-port 40000 https://oxide.sys.rack2.eng.oxide.computer/v1/ping
{"status":"ok"}%
At this point, there is a TIME_WAIT entry in the flow table:
BRM44220011 # opteadm dump-tcp-flows -p opte0 | grep 40000
TCP:172.30.2.5:443:172.20.3.69:40000 TIME_WAIT 24 12 13 2508 5761
Waiting 120 seconds so that most operating systems would expire the port's TIME_WAIT state, and trying again:
atrium% sleep 120
atrium% curl --local-port 40000 https://oxide.sys.rack2.eng.oxide.computer/v1/ping
and this hangs.
Note that the SEGS IN
and BYTES IN
fields in the TCP flow output are incrementing, which is the same symptom as the original outbound flow problem.
FLOW STATE HITS SEGS IN SEGS OUT BYTES IN BYTES OUT
TCP:172.30.2.5:443:172.20.3.69:40000 TIME_WAIT 30 18 13 3396 5761
Once curl
gives up on the client and resets the connection, the flow is removed.
BRM44220011 # opteadm dump-tcp-flows -p opte0 | grep 40000
BRM44220011 #