oxidecomputer/opte

Old TCP flows are in `FLOW_WAIT`, blocking port reuse

Closed this issue · 2 comments

@citrus-it observed that outbound TCP flows will occasionally hang while running a package sync operation from an OmniOS VM in Colo to both a DigitalOcean-hosted and Oxide-hosted mirror of the package repository. This workload is notable for spinning up a large number of short-lived TCP flows. Specifically the host will remain in SYN-SENT, none of the (repeated) SYN packets will be seen by the other host, and the OPTE state is caught in TIME-WAIT fo an unbounded time.

An initial reproduction hit this using curl -I <site> in a bash loop, but we found that this can be consistently triggered using curl -I --local-port 56000 <site>:

  1. Invoke curl. HTTP HEAD request completes successfully.
    a. opteadm dump-tcp-flows shows flow is in state TIME-WAIT.
  2. Invoke curl again.
    a. snoop shows SYNs being sent w/ exponential backoff. Reply hangs.
    b. opteadm dump-tcp-flows shows flow remains in state TIME-WAIT. Out byte/pkt count seems to increment.
    c. kstat for port in question shows out_drop_tcp_err increments in time with SYNs.
  3. Ctrl-C.
    a. snoop shows RST sent.
    b. opteadm dump-tcp-flows shows flow is cleared.
  4. GOTO 1.

Currently we're not doing any periodic pruning of TCP flow table entries, which makes it quite likely that one will encounter this on a long-running VM (or short-running if using pkg!).

  • We could do periodic pruning, but this presents the issue that we don't know exactly what a guest's TCP_WAIT timer configuration is.
  • The alternative is that, at least in the outbound case, a new SYN should always overwrite flow-state. I'd need to think more about whether it's sane to do this on inbound flows.

Thanks for writing this up.

I think we should also do periodic pruning of at least TIME_WAIT entries as otherwise a source port that's only used once by a guest persists in the tcp flow table and a guest with multiple IP addresses could presumably use it all up. The challenge here is on what expiry time to pick as the guest's timer will vary -- Linux and illumos seem to default to 60s, but that can of course be tuned and I've seen it tuned in the field certainly down to 20s -- and Windows and MacOS default to 120s.

As I understand it, this is about how long to wait around for any remaining ACKs on the connection, so setting the expiry to 120s in conjunction with flushing it immediately if a new SYN packet appears from the guest, seems like a reasonable starter for 10.

I took a closer look at the problem reported in https://github.com/oxidecomputer/customer-support/issues/25 to see if it is the same or a similar problem.

I used curl on atrium to ping the API with a source port of 40000. Before doing that, I checked that there is no entry in the TCP flow table for this port:

BRM44220011 # opteadm dump-tcp-flows -p opte0 | grep 40000
BRM44220011 #
atrium% curl --local-port 40000 https://oxide.sys.rack2.eng.oxide.computer/v1/ping
{"status":"ok"}%

At this point, there is a TIME_WAIT entry in the flow table:

BRM44220011 # opteadm dump-tcp-flows -p opte0 | grep 40000
TCP:172.30.2.5:443:172.20.3.69:40000             TIME_WAIT    24       12       13       2508       5761

Waiting 120 seconds so that most operating systems would expire the port's TIME_WAIT state, and trying again:

atrium% sleep 120
atrium% curl --local-port 40000 https://oxide.sys.rack2.eng.oxide.computer/v1/ping

and this hangs.

Note that the SEGS IN and BYTES IN fields in the TCP flow output are incrementing, which is the same symptom as the original outbound flow problem.

FLOW                                             STATE        HITS     SEGS IN  SEGS OUT BYTES IN   BYTES OUT
TCP:172.30.2.5:443:172.20.3.69:40000             TIME_WAIT    30       18       13       3396       5761

Once curl gives up on the client and resets the connection, the flow is removed.

BRM44220011 # opteadm dump-tcp-flows -p opte0 | grep 40000
BRM44220011 #