TCP flow deadlock: receive window closes and never opens again
mfp opened this issue · 8 comments
I have observed the following situation which results in a flow deadlock:
- on the receiver, the user buffer fills and the TCP window progressively closes until it reaches small values (but not necessarily 0)
- on the sender, data transmission stops as per normal flow control
- on the receiver, the data from the user buffer is consumed but no further ACKs are sent to increase the receive window
- deadlock
This is a situation akin to the one prevented by the TCP persist timer (which I don't see implemented BTW), but it happens in absence of packet loss, and even when the window is not exactly 0.
AFAICS in the situation where the User_buffer.Rx
becomes full, the receive window is about 0, the sender stops and there are no unACK'ed segments in the Segment.Rx
queue (or maybe when there are some but they are quickly depleted when data is consumed and some space left, with no net receive window increase), there is no mechanism to let the receiver inform the sender that its receive window has increased as the data is consumed by the application.
I've verified that the deadlock does not happen (obviously) if I comment out the Window.set_rx_wnd
call in User_buffer
, or if I change set_rx_wnd
to make sure it is never smaller than a few MSS (with some care to cope with rounding and window scaling).
Note that flow control does work even if the receive window is not decreased as the user buffer fills, since the unACKed data would accumulate in Segment.Rx
and eventually (if the application data is not read fast enough) reach the window size. The first effect would be that the amount of memory used on the receiver is twice the initial (and unmodified) receive window (once for the application data, and 2x for the unACKed data held in the Segment.Rx
queue).
If the second workaround is applied, the extra amount of memory is bounded by a few MSS, so that'd seem a priori a workable quick fix.
If I'm reading segment.ml
correctly, we can also prevent deadlocks in (ACK) packet loss scenario easily: the extra unACKed data ensures at least one ACK will be sent as the user buffer depletes, and if it's lost, the retransmission timer would trigger a resend that could be met by an ACK (force_ack = true
in
mirage-tcpip/src/tcp/segment.ml
Line 151 in 4bb3820
force_ack
thus undoing partially #08e8692 for that specific case.
Edit: I've realized this would also prevent the silly window syndrome, so I wonder: is there a reason I can't see (it's late...) why there shouldn't be a non-zero lower bound for the receive window, or iow. is there any problem in always having a little extra unACKed data around even when the receive buffer is full? I can see it'd cause a few extra packet retransmissions every (at least) 667ms if the application data consumption speed is exceedingly slow (the computed RTT would keep increasing though as well as the retransmission timer period, thus limiting overhead).
Thanks very much for this thoughtful and well-written report, @mfp . I'll have a closer look at this later today.
Do you have code for reproducing this, or did you observe it in some running system?
The issue is very easily reproduced once you know what you're looking for, all it takes is a consumer that's slower than the sender.
Here's a self-contained testcase: https://gist.github.com/mfp/1250890c2e7fa790365369116855aa0f
I patched mirage-tcpip trivially as per https://gist.github.com/mfp/c19f41c98d6ad28fab2223f567dba1e3 to dump more TCP window info on ACK reception, and the output looks as follows -- the lines for the server and client are intermingled, but you can see the server's TX window decreasing quickly until there's no further transmission.
tcpip_bug: [DEBUG] sequence validation: seq=452858349 range=452858349[262140] res=true
tcpip_bug: [DEBUG] Window: rx_nxt=452858349 rx_nxt_inseq=452858349 tx_nxt=466847192 rx_wnd=262140 tx_wnd=10180 cwnd=273509 snd_una=466847152 backoffs=0 rto=667000000
tcpip_bug: [DEBUG] Established - Recv_ack(466847192) -> Established
tcpip_bug: [DEBUG] no size limit, sending
tcpip_bug: [DEBUG] sequence validation: seq=466847192 range=466847192[10140] res=true
tcpip_bug: [DEBUG] Window: rx_nxt=466847192 rx_nxt_inseq=466847192 tx_nxt=452858349 rx_wnd=10140 tx_wnd=262140 cwnd=11880 snd_una=452858349 backoffs=0 rto=667000000
tcpip_bug: [DEBUG] Established - Recv_ack(452858349) -> Established
tcpip_bug: [DEBUG] no size limit, sending
tcpip_bug: [DEBUG] sequence validation: seq=452858349 range=452858349[262140] res=true
tcpip_bug: [DEBUG] Window: rx_nxt=452858349 rx_nxt_inseq=452858349 tx_nxt=466851152 rx_wnd=262140 tx_wnd=10140 cwnd=273566 snd_una=466847192 backoffs=0 rto=667000000
tcpip_bug: [DEBUG] Established - Recv_ack(466851152) -> Established
tcpip_bug: [DEBUG] no size limit, sending
tcpip_bug: [DEBUG] sequence validation: seq=466851152 range=466851152[6180] res=true
tcpip_bug: [DEBUG] Window: rx_nxt=466851152 rx_nxt_inseq=466851152 tx_nxt=452858349 rx_wnd=6180 tx_wnd=262140 cwnd=11880 snd_una=452858349 backoffs=0 rto=667000000
tcpip_bug: [DEBUG] Established - Recv_ack(452858349) -> Established
tcpip_bug: [DEBUG] no size limit, sending
tcpip_bug: [DEBUG] sequence validation: seq=452858349 range=452858349[262140] res=true
tcpip_bug: [DEBUG] Window: rx_nxt=452858349 rx_nxt_inseq=452858349 tx_nxt=466851192 rx_wnd=262140 tx_wnd=6180 cwnd=273623 snd_una=466851152 backoffs=0 rto=667000000
tcpip_bug: [DEBUG] Established - Recv_ack(466851192) -> Established
tcpip_bug: [DEBUG] no size limit, sending
tcpip_bug: [DEBUG] sequence validation: seq=466851192 range=466851192[6140] res=true
tcpip_bug: [DEBUG] Window: rx_nxt=466851192 rx_nxt_inseq=466851192 tx_nxt=452858349 rx_wnd=6140 tx_wnd=262140 cwnd=11880 snd_una=452858349 backoffs=0 rto=667000000
tcpip_bug: [DEBUG] Established - Recv_ack(452858349) -> Established
tcpip_bug: [DEBUG] no size limit, sending
tcpip_bug: [DEBUG] sequence validation: seq=452858349 range=452858349[262140] res=true
tcpip_bug: [DEBUG] Window: rx_nxt=452858349 rx_nxt_inseq=452858349 tx_nxt=466855152 rx_wnd=262140 tx_wnd=6140 cwnd=273680 snd_una=466851192 backoffs=0 rto=667000000
tcpip_bug: [DEBUG] Established - Recv_ack(466855152) -> Established
tcpip_bug: [DEBUG] timerloop: stoptimer
tcpip_bug: [DEBUG] timerloop: stoptimer
=========== DEADLOCK!!! =============
=========== DEADLOCK!!! =============
Both commenting out the line that resizes the window in user_buffer.ml
and adding a lower bound to the window in set_rx_wnd
get rid of the deadlock.
Thanks again for this report and for the repro code. I've adapted your gist (crediting you with the initial commit) and mashed it into the test suite at https://github.com/yomimono/mirage-tcpip/tree/deadlock_test . Please let me know if you don't want this code included in the repository here.
I tend to agree that the best quick solution for the moment is to set a lower bound on set_rx_wnd
unless someone else wants to take on a more complicated solution (logic to not send ACKs if the receive window is small, persist timer). Perhaps @hannesm or @balrajsingh has thoughts on this?
@mfp would you be amenable to taking the deadlock_test
tree above, cherry-picking your careful set_rx_wnd
-limiting fix, and submitting a PR?
I settled with a simpler expression for the lower bound (I originally had max sz (Int32.of_int (((3 * t.tx_mss + 1 lsl t.rx_wnd_scale) lsr t.rx_wnd_scale) lsl t.rx_wnd_scale))
, in an attempt to make the bound tighter, but it's not like it really matters anyway), see #343
Tangentially, I also have a trivial patch to set the TCP window scale and initial congestion window (initcwnd
) via env. vars. Should I submit a PR for it?
I settled with a simpler expression for the lower bound (I originally had max sz (Int32.of_int (((3 * t.tx_mss + 1 lsl t.rx_wnd_scale) lsr t.rx_wnd_scale) lsl t.rx_wnd_scale)), in an attempt to make the bound tighter, but it's not like it really matters anyway), see #343
Thank you!
Tangentially, I also have a trivial patch to set the TCP window scale and initial congestion window (initcwnd) via env. vars. Should I submit a PR for it?
Code that uses the Unix
module anywhere other than the socket stack unfortunately won't be merged, because tcpip
needs to be buildable and usable without Unix
, Sys
, and other conventional OS friends.
Code that uses the Unix module anywhere other than the socket stack unfortunately won't be merged, because tcpip needs to be buildable and usable without Unix, Sys, and other conventional OS friends.
I see. What about trivial getter/setters then (leaving it up to the application to set the values)?
A more elegant way would be to have the moral equivalent of a typed sysctl provided somewhere in Mirage, but that requires a lot more work (deciding where it goes, coordinating more libs, etc.)
What about trivial getter/setters then (leaving it up to the application to set the values)?
Yeah, that's much friendlier, and what I was about to come back to suggest.
FWIW, we do have a facility for getting values set by the user at compile-time or runtime into Mirage unikernels already, but it requires some coordination between the mirage
tool and the library in question -- see the "Configuration keys" section of https://mirage.io/wiki/hello-world for a little more info.
Thanks for the bug report, I'm sorry to be not too familiar with this code base to be able to evaluate it.
A more elegant way would be to have the moral equivalent of a typed sysctl provided somewhere in Mirage, but that requires a lot more work (deciding where it goes, coordinating more libs, etc.)
Yes, this work is likely needed at some point. I'd be much in favour of someone tackling this problem than a very isolated solution just for the current use case. My current ideas would be either using plan9 - we have an implementation in OCaml - or SNMP, but I'm really open to suggestions in that area.
SNMP is already widely deployed with lots of tooling available (for monitoring, setting values, etc.), and can be used to both read and write values (as well as triggering traps). With plan9 I don't have any experience.