Better send buffer management: await sock.writeable() and TCP_NOTSENT_LOWAT
Closed this issue · 14 comments
Update: based on the discussion below, I now think that the resolution is that Curio should ideally:
- Definitely: provide an API like
await sock.writeable()
(or whatever spelling is preferred) - Probably: enable
TCP_NOTSENT_LOWAT
on sockets whenever possible/convenient, with a buffer size of ~8-16 KiB. This only provides full benefits for code that usessock.writeable()
, but it provides some benefits regardless.
Original report follows, though note that a bunch of my early statements are incomplete/wrong:
So I just stumbled across a rather arcane corner of the Linux/OS X socket API while trying to understand why I'm seeing weird buffering behavior in some complicate curio/kernel interaction.
As we know, calling socket.send
doesn't immediately dump data onto the network; instead the kernel just sticks it into the socket's send buffer, to be trickled out to the network as and when possible. Or, if the send buffer is full, then the kernel will reject your data and tell you to try again later. (Assuming non-blocking mode of course.)
But for various reasons, it turns out that the kernel's send buffer is usually way larger than you actually want it to be, which means you can end up queueing up a huge amount of data that will take forever to trickle out, introducing latency and causing various problems.
So at least Linux and OS X have introduced the euphoniously named and terribly documented TCP_NOTSENT_LOWAT
feature. Basically what it does is let you use setsockopt
to tell the kernel -- hey, I know that you're willing to buffer, like, 5 MiB of data on this socket. But I don't actually want to do that. Can you only wake me up when the amount of data that's actually buffered drops below, like, 128 KiB, and I'll just top it up to there? (This is a bit of a simplification because there are some subtleties about how you budget for data that's queued to send vs. data that's been sent-but-not-yet-acked, but it's good enough to go on.)
But it turns out that TCP_NOTSENT_LOWAT only affects polling-for-writability. So you absolutely can have a socket where select
and friends say "not writeable", but at the same time send
is happy to let you write lots and lots of data. And this is bad, because it turns out literally the only way the kernel is willing to give you the information you need to avoid over-buffering is with that "not writeable" signal.
So if you want to avoid over-buffering, then you have to always call select
-or-whatever before you call send
, and only proceed if the socket is claimed to be writeable.
And unfortunately, right now, curio never does this: it always tries calling send
first, and then only if that fails does it block waiting for writeability.
The solution is simple: before calling socket.send
, check that the kernel thinks the socket is writeable.
The obvious way to do this would be to replace the current implementation of Socket.send
with something like:
async def send(self, data, flags=0):
while True:
await _write_wait(self._fileno)
try:
return self._socket_send(data, flags)
except WantWrite:
pass
except WantRead:
await _read_wait(self._fileno)
(and similarly for the other methods. I guess sendall
could usefully be rewritten in terms of send
, and I'm not sure what if anything would need to be done for sendmsg
and sendto
. TCP_NOTSENT_LOWAT doesn't apply to UDP, so for sendto
it maybe doesn't matter, but I guess might be better to be safe? And I don't remember what sendmsg
is for at all.)
The one potential downside of this strategy that I can see is that right now, send
never blocks unless the write actually fails, and if we add an await _write_wait
then it will generally suspend the coroutine for one "tick" before doing the actual write, even when the write could have been done without blocking. I guess this might actually be a good thing in that it could promote fairness (think of a coroutine that's constantly writing to a socket with a fast reader, so the writes always succeed and it ends up starving everyone else...), but it might have some performance implications too.
The alternative, which preserves the current semantics, would be to do a quick synchronous check up front, like:
async def send(self, data, flags=0):
if not select.select([], [self._fileno], [], 0)[1]:
await _write_wait(self._fileno)
while True:
try:
return self._socket_send(data, flags)
except WantWrite:
await _write_wait(self._fileno)
except WantRead:
await _read_wait(self._fileno)
How I managed to confirm this for myself (Linux specific, and mostly recording this for reference, not really any need to read it):
- Enable TCP_NOTSENT_LOWAT globally (because I'm too lazy to figure out the right setsockopt incantation):
echo 128000 | sudo tee /proc/sys/net/ipv4/tcp_notsent_lowat
- start a
socat TCP-LISTEN:4002 STDOUT
in one terminal. - Immediately hit control-Z to suspend the socat, because we want to see what happens if our send buffer fills up.
- Make a connection and get some useful utilities:
In [1]: import socket
In [2]: import select
In [3]: s = socket.create_connection(("localhost", 4003))
In [4]: import struct
In [5]: import fcntl
In [6]: def get_send_buf_size(sock):
...: return struct.unpack("I", fcntl.ioctl(sock.fileno(), 0x5411, b"\0\0\
...: 0\0"))[0] # 0x5411 = SIOCOUTQ
...:
In [7]: s.setblocking(False)
Start filling up our send buffer. At first the data goes into our send buffer and then immediately drains into the other side's receive buffer, but since the other side is asleep then eventually this stops and our send buffer starts filling up:
In [8]: get_send_buf_size(s)
Out[8]: 0
In [9]: s.send(b"x" * 1000000)
Out[9]: 174760
In [10]: get_send_buf_size(s)
Out[10]: 0
In [11]: s.send(b"x" * 1000000)
Out[11]: 523864
In [12]: get_send_buf_size(s)
Out[12]: 0
In [13]: s.send(b"x" * 1000000)
Out[13]: 261932
In [14]: get_send_buf_size(s)
Out[14]: 0
In [15]: s.send(b"x" * 1000000)
Out[15]: 130966
In [16]: get_send_buf_size(s)
Out[16]: 121366
Okay, there are 121366 bytes enqueued in our send buffer. That's a little bit below the TCP_NOTSENT_LOWAT that we set, so our socket should still be writeable:
In [17]: select.select([], [s], [])
Out[17]:
([],
[<socket.socket fd=11, family=AddressFamily.AF_INET, type=2049, proto=6, laddr=('127.0.0.1', 52146), raddr=('127.0.0.1', 4003)>],
[])
Put some more data in, pushing it over the 128000 limit:
In [18]: s.send(b"x" * 10000)
Out[18]: 10000
In [19]: get_send_buf_size(s)
Out[19]: 131366
Now if we check using select
, it's not writeable:
In [20]: select.select([], [s], [])
# hangs until:
KeyboardInterrupt:
BUT if we call send
, then no problem, we can definitely write more data to this "non writeable" socket:
In [21]: s.send(b"x" * 100000)
Out[21]: 55483
In [22]: get_send_buf_size(s)
Out[22]: 186849
Okay, after reading up on it a bit more, it sounds like sendmsg
generalizes both sendto
and send
. So maybe the simplest thing is actually to make sendmsg
smart, and then make sendto
and send
and sendall
thin wrappers around that.
Wow, this is pretty interesting. I have two immediate questions: I wonder how asyncio handles writes? I wonder how this would work in synchronous thread-based coding?
Brief thought: I wonder if this could be fixed via inheritance or some other specialization of the socket class? For example, one implementation that tries the send first. Another implementation that awaits first.
Definitely going to think about this...
By my reading of the asyncio source, it seems that it also performs a send() immediately followed by a check for blocking. For example, in asyncio/selector_events.py:
if not self._buffer:
# Optimization: try to send now.
try:
n = self._sock.send(data)
except (BlockingIOError, InterruptedError):
pass
except Exception as exc:
self._fatal_error(exc, 'Fatal write error on socket transport')
return
Good point, I just filed a bug on asyncio too to let them know :-)
I wonder if this could be fixed via inheritance or some other specialization of the socket class? For example, one implementation that tries the send first. Another implementation that awaits first.
My first reaction is that this sounds really overcomplicated -- is it really so bad to make it Just Work?
Adding an await _write_wait() introduces about an 85% performance penalty on the Curio echo-server benchmark.
In the big picture, in what scenarios is this TCP_NOTSENT_LOWAT option being used? I can imagine a lot of situations where I would want to curio to behave as it does now. For example, services where it's all based on a request/response cycle like HTTP, RPC, etc.
TCP_NOTSENT_LOWAT seems like a pretty special case to me. It should definitely possible for Curio to support it in some way if someone wanted, but I'm not sure I'd want to add the penalty of the extra write wait to everything to do it.
One option would be to add a new socket method for explicit write waiting if you needed it. For example:
await sock.write_wait()
await sock.send(data)
Another option might be to have Curio intercept setsockopt() and look for TCP_NOTSENT_LOWAT. Based on that, it could enable the extra wait implicitly. Or this could be turned into some kind of more general method/configuration option for sockets that make Curio wait first.
Having just skimmed, I do have the same question -- when is this option actually useful?
Adding an await _write_wait() introduces about an 85% performance penalty on the Curio echo-server benchmark.
Oof. Well, good to know; thanks for checking. Did you also try the version where we do a quick synchronous check (select.select([], [sock], [], 0)
) instead of trapping all the way out to the event loop and rescheduling?
I have mixed feelings about using the echo-server benchmark as a guide in cases like this, because its overhead is so incredibly low that I think it can easily push in the wrong direction. If the echo server starts out at 40 us to handle a "request/response cycle", then that's such an incredibly small baseline that it really constrains what you can do. Doing anything at all extra will make that number much worse (going from 40 us -> 50 us is terrible!), while trying to optimize it to post more impressive numbers requires really heroic feats because there's so little headroom to work with -- pretty soon the only way to improve is to start cutting features. But, any real protocol has much higher protocol overhead than that, just to do anything -- e.g. h11
on CPython needs ~200 us to handle a trivial request/response cycle. So suddenly that "terrible slowdown" is from 240 us -> 250 us, which doesn't look so bad. And while going from 40 us -> 20 us is "amazing shocking make a blog post about how we're the fastest", going from 240 us -> 220 us is pretty meh. And that's still just pure protocol overhead, not counting the work needed to actually compute the response. If your average web app needs 5 ms to hit the database, render templates, blah blah, then now we're talking about 5240 us versus 5250 us versus 5220 us, and... yeah. So echo server benchmarks are obviously useful as a way to get some insight into a particular IO loop, optimize, etc., but I think using them as a competitive benchmark and worrying about the raw numbers can easily push us into a harmful direction.
Still... I get that these benchmarks are important and that 85% is a lot!
(I've thought about this in particular in the context of h11
versus Yury's super-fast httptools
-- httptools
posts incredibly impressive numbers in microbenchmarks with well-behaved peers, while h11
is slower because it actually implements the whole HTTP protocol. For h11
to try to compete with httptools
on pure numbers would just force it to start stripping out bits of the protocol, which doesn't help anyone.)
In the big picture, in what scenarios is this TCP_NOTSENT_LOWAT option being used? I can imagine a lot of situations where I would want to curio to behave as it does now. For example, services where it's all based on a request/response cycle like HTTP, RPC, etc.
The tl;dr is that TCP_NOTSENT_LOWAT is basically always at least as good as the alternative and should probably be the default. I'm not sure why it isn't (though I note that the people who implemented it did provide a global knob so that you can make it the system-wide default, suggesting that they agree that this isn't ridiculous -- my guess is that the main reason is that there's a parameter you need to set and they don't have any auto-tuning for it yet.) And TCP_NOTSENT_LOWAT's killer use-case is HTTP/2, where you really want it.
Full explanation follows; if you want to skip it then scroll down to the next bit of quoted text :-)
So, background, maybe review maybe not I don't know how much you keep up with network engineering drama :-). A major problem in using TCP currently is the presence of too much buffering (aka "bufferbloat"). The point of a buffer is to smooth out bursts, so the ideal buffer should fill up when a burst arrives, then trickle down so that it reaches empty just before the next burst arrives -- that way there's always data to send, but no unnecessary delays. The problem with TCP is that if you're sending data fast enough that the network is your bottleneck, then it will happily fill up all the buffers in its path completely full. At that point, they're not buffers, they're just delay lines. (Think of a grocery store checkout where people arrive at a rate of exactly 1/minute, and where the clerk can process exactly 1/minute. If there's no queue, then each person walks up, gets handled, and leaves 1 minute later, and that's a steady state. If there's a queue of 100 people, then everything still moves at the same speed, and it's still a steady state, but each individual person has to wait 100 minutes. Which is super frustrating! You have the capacity to get everyone through without waiting, it's just the "buffer" thats killing you. I feel like I often encounter telephone service centers that work on this model.)
This is a multi-faceted problem because these kinds of buffers show up in all sorts of places (the per-socket send buffer, your kernel's network driver, routers out on the internet, ...), and everyone suddenly realized back in 2009 that oops no-one was paying attention and all of them are broken in the same way. Since then there's been a concerted effort to fix these, and things have been getting a lot better. (Mostly irrelevant but super interesting tangent: just last month Google released a major rework of TCP's flow control algorithms that they're hoping to get deployed everywhere to fix a lot of these problems.)
TCP_NOTSENT_LOWAT is aimed at addressing a particular one of these buffers: the per-socket send buffer. The problem with this buffer is that it actually serves two different purposes: it holds data that the application has written to the socket but that is still waiting to be sent on the network, and it also holds data that has already been sent, but that hasn't been acknowledged yet, so the kernel has to hold onto it in case it needs to be resent. And this is a problem because you generally need only a small buffer for the unsent data, and a large buffer for the sent-but-unacknowledged data. (This is because the unsent data buffer just needs to hold enough so that it doesn't run dry between each iteration of the process's event loop, so like, a few milliseconds worth of data at most. The unacknowledged data buffer OTOH needs to hold at least one round-trip-time worth of data, maybe more if conditions are bad, so hundreds or even thousands of milliseconds worth of data. Totally totally different things.) But since the kernel doesn't distinguish between these two kinds of data, then traditionally you just get one giant buffer for both, which means your unsent-data ends up spilling over into the space that really should be used for sent-but-unacknowledged data, and if you just keep dumping data into it until it fills up, then now you've got one of those queues where your data will be waiting in line for hundreds of milliseconds for no reason.
Turning on TCP_NOTSENT_LOWAT fixes this: it effectively tells the kernel to keep track of unsent-data separately from sent-but-unacknowledged data, and then you can use an appropriately sized buffer instead of a ludicrously oversized one. (In some experimenting with simple rate-throttled proxy servers over loopback on my laptop, I was getting ~3-5 second latencies, measured as the time from when I sent data from one process to when it was received in another. Over loopback. After turning on TCP_NOTSENT_LOWAT that drops down to milliseconds or better.)
So that's what I mean about TCP_NOTSENT_LOWAT being basically a Good Thing: it's fixing a bug. But, of course, there are plenty of cases where this bug is still a bug, but not one that really matters. If you're not saturating the pipe (think: IRC, or interactive ssh), then queues don't form anyway so this doesn't matter. Or if you're doing a bulk transfer where no-one's paying attention (think: bittorrent) then latency doesn't really matter much (except the for the minor issue that each of these buffers can end up wasting a few megabytes of kernel memory for no reason). OTOH an example of where this really matters is HTTP/2.
The big idea of HTTP/2 is that web pages are made out of lots of parts (HTML, CSS, JS, images, ...), and instead of making lots of separate TCP connections to fetch these like in HTTP/1.1, we're going to make a single connection and then multiplex all the downloads over that one connection. But not all of these parts are created equally -- browsers go to a lot of trouble to try and fetch the important parts of the page first, because this has a direct effect on perceived web page speed. (If a web page shows you the main content after 200 ms then you don't care if it's still loading some images down at the bottom of the page; but if it loads those images first before the rest of the page then that's terrible.) So to handle this, HTTP/2 has a sophisticated system for prioritizing which resources get sent first. The way this ends up working is, each time the socket becomes writeable, you look around and find the highest priority resource that's ready to transmit, and you send a chunk of that. But once you've passed it off to the kernel, then you're committed. So you want to delay this decision as long as possible, because otherwise you risk committing to sending some low priority resource that's ready early, and then a high priority resource becomes ready but it's too late to take that back. In the supermarket analogy again, imagine that they put new items on sale at random times, but once you get in line you can't change what's in your cart. If there's a long line, you run the risk that things you want to buy come on sale while you're waiting in line and it's too late to get them; if the line is short, then you can wait until right before you check out before grabbing what you want.
Here's a war story of this bug biting Google Maps. (And this also explains why Google is throwing engineers at fixing TCP in general -- they care a lot about HTTP/2.)
Another example where you could see this kind of thing is if you use ssh in persistent connection mode, then you can end up with a single ssh connection that's multiplexing a bulk file transfer with scp and a regular interactive terminal at the same time. If the bulk file transfer fills up your buffer, then now all your interactive key strokes have to wait in that queue, and it can basically become unusable.
One option would be to add a new socket method for explicit write waiting if you needed it. [...] Another option might be to have Curio intercept setsockopt() and look for TCP_NOTSENT_LOWAT. Based on that, it could enable the extra wait implicitly.
I guess the are two use cases:
- Programs whose authors understand these issues, have thought about them, and are intentionally taking the trouble to enable TCP_NOTSENT_LOWAT.
- Programs where their authors didn't think about these things, but then they run on a system where TCP_NOTSENT_LOWAT has been globally enabled by the sysadmin. (Possibly because the sysadmin tracked down some operational problem caused by the program filling up buffers and causing latency spikes, and the sysadmin would rather fix this by flipping a switch in
/sys
instead of modifying the source code to third-party libraries. I also wouldn't be surprised if we saw TCP_NOTSENT_LOWAT being enabled more commonly in general as people realize what it does, though this isn't happening yet.)
I think the proposal of enabling it for some sockets as an extra step works fine for the first group (they're already used to jumping through hoops to set up sockets). So that would certainly be a step forward.
It Would Be Nice(tm) if both cases just worked.
Unfortunately, though, I just checked and it doesn't look like there's any reliable way to query a socket to find out whether TCP_NOTSENT_LOWAT is enabled (at least as of Linux 4.6.0) :-(. You can do getsockopt(fd, TCP_NOTSENT_LOWAT, ...)
and it works in the sense that if setsockopt
was called before then it will return the same number; but if the sysadmin set a global value, then getsockopt
won't tell you that. (Internally, the logic seems to be: each socket has an field storing the TCP_NOTSENT_LOWAT value; if this field is set to 0, then it falls back on the global setting. getsockopt
/setsockopt
only manipulate the per-socket setting.) This also means that if the sysadmin fiddles with the global setting while a program is running, then the new setting will get picked up immediately, even by existing sockets. But of course there's no notification to tell curio or whoever that it needs to re-check.
This presentation from Apple (slides -- starting page 66, video / transcript) has some useful further information. I'll summarize for posterity:
They point out another example of where better-controlling the socket send buffer is critically important: applications that can dynamically adjust the data they send to match the network. E.g., streaming video where if you're short on bandwidth you can lower the video quality, or something like VNC where you can drop the refresh rate to preserve interactivity. (Their demo is Apple Screen Sharing having 3 seconds of lag on a connection with a 35 ms ping.) xpra is a Python remote display app that works like this. Basically the core loop goes like:
while True:
await sock.writeable()
# at the last possible moment, take a screenshot
data = take_screenshot()
await sock.sendall(data)
(Notice: an app like this actually needs sock.writeable
available as a primitive even if it isn't using TCP_NOTSENT_LOWAT.)
What's interesting about this example is it explains why the designers decided to make select
and send
disagree about whether the socket is writeable: if your proper low water mark is like 8 KiB (plausible), but your protocol means that it's meaningless to send chunks smaller than, say, 100 KiB (VNC sending half-a-screenshot doesn't necessarily make sense), then you don't want to break that 100 KiB chunk into 8 KiB pieces and loop back and forth through the kernel/userspace/scheduler dribbling them out. What you want is to wait until the send buffer is almost empty (with select
or whatever), then enqueue one full chunk, then go to sleep until the send buffer is almost empty again.
And finally, they argue that TCP_NOTSENT_LOWAT is basically always a win, it's just a question of how much. In fact, I'll go ahead and quote:
And we started making slides for this presentation where we had two columns. We had the apps that should use the low-water mark option and the apps that shouldn't. And we couldn't think of any to go in the shouldn't column. Every time we thought of a traditional app like file transfer, well, that doesn't need it. We realized you've had that experience where you change your mind about a file transfer and you press Control-C and it seems to take about 30 seconds to cancel. It's because it had over-committed all of this data into the kernel and it had to wait for it to drain because there's no way to change your mind. So, yeah, actually, file transfer does not benefit from over-committing data, and we couldn't think of any application that does benefit from over-stuffing the kernel.
So once we had that realization, we decided starting in the next seed [= what Apple calls a beta release], this option will be turned on automatically for all connections using the higher layer NSURLSession and CFNetwork APIs. All you have to do to make best use of this is when your socket becomes writable, don't loop writing as much data as you can until you get an EWOULDBLOCK. Just write a sensible-sized unit, and then wait to be told it's time for more.
That all makes perfect sense to me, so I've changed my mind :-). Methods like send
and sendall
should not check for writeability, and there should be some public API like await sock.writeable()
. Also, it probably would be a nice extra bonus to automatically enable TCP_NOTSENT_LOWAT whenever possible (i.e. recentish Linux / OS X), with some reasonable default buffer size like 8 or 16 KiB.
I'll update the issue title to match.
Wow! Thanks for writing this up. This is really interesting. If anything, it reconfirms my view that Curio shouldn't be doing it's own buffering. That writeable() method can definitely be added (I'm thinking the select() approach will be much more efficient, but will need to experiment).
Adding an await _write_wait() introduces about an 85% performance penalty on the Curio echo-server benchmark.
but is that realistic for typical network servers/clients?
I've added a socket.writeable() method that waits until a socket is writeable. It uses select(). The performance impact of it seems almost negligible when used. In light of earlier discussion, it seems that it should still be separate from send() though.
On further investigation it turns out that my original statements were actually wrong/incomplete. On Linux, TCP_NOTSENT_LOWAT does actually affect select
and send
and all their friends equally. Even though I gave an example in my first message that seemed to demonstrate that this was not true. According to my improved (but no doubt still imperfect) understanding, I believe that what happened is:
- The SIOCOUTQ ioctl I used to check the buffer size is documented to tell us how much unsent data there is in the buffer. But it doesn't, it actually tells us how much unsent+unacknowledged data there is. So when I was able to write to a buffer that had 131366 bytes in it despite having TCP_NOTSENT_LOWAT set to 128000, that's because the 131366 included some unacknowledged data that TCP_NOTSENT_LOWAT was ignoring. (I filed a bug to at least hopefully get the man page fixed at least)
- There is one case where
select
andsend
disagree. It has nothing to do with TCP_NOTSENT_LOWAT, it just happened by accident to kick in at exactly the same time that TCP_NOTSENT_LOWAT would have kicked in if SIOCOUTQ had been telling the truth. The situation where they disagree is if the buffer is between 66% and 100% full, thenselect
says its not writeable, butsend
will still succeed. (The point of this is to avoid waking up processes over and over every time the buffer drops to 99% full.) This is sort of irrelevant to us, just an interesting/confusing fact.
However! AFAICT, on macOS, it is actually true that TCP_NOTSENT_LOWAT affects select
-and-friends, but does not affect send
-and-friends. This is based on source code reading, not actual testing, so I could be wrong, but I think it's true. It's also a total coincidence that my misunderstanding of Linux turned out to be correct for macOS, since I didn't look at macOS at all when first investigating this...
Anyway, none of this really affects any of the conclusions here, I just wanted to correct that misinformation in case someone stumbles across this thread in the future.
Closing this for now.
The case where select and send disagree hints at the one time where TCP_NOTSENT_LOWAT could hurt -- if the application already has the data, and sending it to the kernel now will let the app do some cleanup (or exit, or further progress while the CPU is otherwise idle). That said, using asynchronous output for an app like this is enough of a corner case that TCP_NOTSENT_LOWAT should probably be the default.