microsoft/WSL

Intermittent (repeatable) hang sending/receiving a large buffer over TCP

aseering opened this issue · 23 comments

This is tested against build 14361. It might be an underlying cause of #472 ; I suspect it's also the root cause of an issue with some closed-source Linux software that I'm playing with.

Reproducer:

  • Open up a Bash window. Save the following Python script to a file; run the script:
import socket

# Code based heavily on examples from https://docs.python.org/2/howto/sockets.html

serversocket = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
serversocket.bind(("localhost", 8088))
serversocket.listen(5)

big_data = "X" + "_"*3000000 + "X"

while 1:
    (clientsocket, address) = serversocket.accept()
    clientsocket.send(big_data)
    clientsocket.close()
  • Open a second Bash window. Run the following shell command/loop:
while nc localhost 8088 > tmp.txt; do date; sleep 1; done

Expected / "what happens on a real Ubuntu system":

  • The shell loop will loop forever, printing the date once per second.

Actual / "what happens in WSL":

  • The shell loop consistently hangs, typically after 10 to 20 iterations
  • It seems to hang faster if I remove the sleep 1 from the shell loop and/or if I make the big_data string constant in the Python script larger.

If I strace the Python process, it looks like this:

(...)
accept(3, {sa_family=AF_INET, sin_port=htons(46046), sin_addr=inet_addr("127.0.0.1")}, [16]) = 4
sendto(4, "X_______________________________"..., 3000002, 0, NULL, 0) = 3000002
close(4)                                = 0
accept(3, {sa_family=AF_INET, sin_port=htons(46050), sin_addr=inet_addr("127.0.0.1")}, [16]) = 4
sendto(4, "X_______________________________"..., 3000002, 0, NULL, 0) = 3000002
close(4)                                = 0
accept(3, {sa_family=AF_INET, sin_port=htons(46054), sin_addr=inet_addr("127.0.0.1")}, [16]) = 4
sendto(4, "X_______________________________"..., 3000002, 0, NULL, 0) = 3000002
close(4)                                = 0
accept(3,
## HANG HERE ##

If I strace the nc process, I no longer experience the hang. If I connect to an nc that has already hung, I get no output; but when I disconnect, nc exits returning an error code (stopping the loop) and printing the following error text:

nc: Polling Error: Interrupted system call

For reference, an strace of a non-hanging nc looks like a lot of:

(...)
read(3, "________________________________"..., 2048) = 2048
write(1, "________________________________"..., 2048) = 2048
poll([{fd=3, events=POLLIN}, {fd=0, events=POLLIN}], 2, 4294967295) = 1 ([{fd=3, revents=POLLIN}])
read(3, "________________________________"..., 2048) = 2048
write(1, "________________________________"..., 2048) = 2048
poll([{fd=3, events=POLLIN}, {fd=0, events=POLLIN}], 2, 4294967295) = 1 ([{fd=3, revents=POLLIN}])
read(3, "________________________________"..., 2048) = 2048
write(1, "________________________________"..., 2048) = 2048
poll([{fd=3, events=POLLIN}, {fd=0, events=POLLIN}], 2, 4294967295) = 1 ([{fd=3, revents=POLLIN}])
read(3, "________________________________"..., 2048) = 2048
write(1, "________________________________"..., 2048) = 2048
poll([{fd=3, events=POLLIN}, {fd=0, events=POLLIN}], 2, 4294967295) = 1 ([{fd=3, revents=POLLIN}])
read(3, "________________________________"..., 2048) = 2048
(...)

Interestingly, when the nc process is hung, if I look at the size of its output file "tmp.txt", I would expect to see a size of either 0 or 3000002, but I instead see:

$ stat --printf="%s\n" tmp.txt
2476032

The exact number is not always the same, but it is pretty consistently right around 2.4mb. I think that has to do with timing, as I have a more-complicated (and closed-source) repro that I have seen hang at around 0.5mb.

Note that the file size is a multiple of 2048 (nc's block size). I've checked a couple times; this initially appears to be repeatable. This, combined with the error message that nc prints when connecting to a hung process via strace, seems to support a hypothesis that it's hanging in poll().

For what it's worth, I have also (on build 14361) seen a hang when downloading large files from a WSL service to a Windows application over localhost. The above hang does not reproduce if I use Windows netcat instead of Linux netcat, so I don't know if it's the same issue. I'll try to come up with a separate reproducer.

Qix- commented

FWIW, while this is a bug and Python probably handles this for you, the system's internal buffers are probably not big enough to support 3000000 bytes. Even OS X (at the least) doesn't.

#include <stdio.h>
#include <sys/types.h>
#include <sys/socket.h>
#include <netinet/in.h>

int main(void) {
    int n;
    unsigned int m = sizeof(n);
    int fdsocket;
    fdsocket = socket(AF_INET,SOCK_DGRAM,IPPROTO_UDP); // example
    getsockopt(fdsocket,SOL_SOCKET,SO_RCVBUF,(void *)&n, &m);
    printf("%d\n", n);
    return 0;
}
$ clang socket.c
$ ./a.out
196724

Though I would imagine Python handles these cases for you.

Have you tried using netcat to download something over an actual NIC card rather than the loopback device (localhost)? Might help in narrowing down the bug a bit more.

@Qix- -- I'm not sure what cases Python would need to handle? TCP is a streaming protocol. It's impossible in the typical case (unless the sends are so small and infrequent that Nagel's Algorithm can't coalesce them) for the recipient to tell the difference between a single send of 3mb vs three sends of 1mb, vs 3,000 sends of 1kb; etc. Over a physical network, the big sent packet gets broken up into small chunks which are streamed to the recipient as they arrive. If too many arrive and the recipient can't buffer them, they queue on the sender, which has the original (in this case) ~3mb buffer still in memory because a large sendto() syscall doesn't return until the recipient has iterated through most of the data. All the recipient strictly needs is a big-enough buffer to hold one packet from the underlying network -- that size is usually measured in kilobytes. (For the loopback device, you arguably don't even need that -- the recipient could be given blocks of data directly out of the buffer that was passed to the kernel by the sender; no additional buffering required.)

(The RCVMSG buffer that you're referring to is actually a performance optimization: Typical userspace applications alternate between reading data and processing data. The kernel could accept enough data to directly fill the buffer passed into read() by the userspace application, then tell the sender to wait until read() is called again. But, as it turns out, repeated starts and stops of the data-stream wreak havoc with TCP's default congestion-control algorithm. So if the kernel did this, TCP would get confused and throttle itself to a very slow speed, thinking (incorrectly) that the network is overloaded. The receiving Linux kernel avoids this problem by continuing to (pre)fetch data from the sender while the recipient process is still processing / not ready for the next block yet, so that the rate of data consumption tends to remain more stable. That's what that buffer is used for.)

Regardless, the behavior over a physical network is a good question. I'm not seeing this when running Python on a real Linux machine and nc within WSL. (Though I might just need to tune the buffers to be larger or adjust some other parameter -- it is, after all, not deterministic.)

I'm also not seeing this with Python running under WSL and nc on a real Linux machine. Though I'll let both run for a few hours just to see.

Also -- please disregard my earlier comment about downloading from a WSL service to Windows. The hang was actually on a WSL->WSL socket. "It's complicated" :-) If I do find a real issue shipping data between WSL and Windows, I'll post a reproducer.

Qix- commented

I was referring to chunking a large buffer into pieces the actual send() syscall can handle. If you're writing more data than the kernel's I/O buffers can handle (as I showed in the small C program) then you're either going to block or your syscalls are going to fail.

My point was that Python was probably doing the logic of splitting up an otherwise large buffer and syscalling send() with smaller chunks under the hood for you.

@Qix- -- sorry if I was unclear -- the actual send() syscall does not require chunking. I was attempting to both explain why it does not require chunking, and how you can see from my straces above that Python is not performing chunking.

@aseering - Thanks for the reliable repro and detailed analysis. We were able to trace the bug down. In the WSL socket TCP code, in some cases we were not waking up the blocked receive thread, so that it can continue and return 0 once the connection is reset. The reason you were seeing the accept hanging was because a receive thread was hung on the nc side, which on return would connect again. On WSL, the recv never returned. It's quite possible that this bug is also causing the apt-get to hang, but that's something I am not able to verify because I don't have a repro for that.

Thanks @sunilmut for the fix! Interesting -- not quite the cause that I had guessed but it makes perfect sense. I look forward to being able to try it out.

I'm still seeing this in build 14366. I assume the fix didn't make it into this build?

It doesn't look like it judging by the release notes.

@aseering - That's correct. The fix is still under review and we are trying to get it for Anniversary update, but the quality gates are becoming tighter. Will try to squeeze it in.

Thanks @benhillis and @sunilmut! I know how release stabilization goes -- good luck with the release! If you're able to fit this in, I would really appreciate it; I think I've been hitting it in various different use cases for quite some time now.

Looks like the fix didn't make it into r14371 either. It's not explicitly called out in the release notes (though there's a "miscellaneous" entry at the end) and I'm still seeing the hang.

@aseering - Update: The fix for this bug was approved for the Anniversary update and it's making it's way up to the release branch. Should be anytime.

@sunilmut -- thanks for the update! I really appreciate it; this fix will be very helpful to me. Hopefully also for other people who (for example) want to run automated tests of their Web app or other code that stresses the socket layer within WSL.

Hm... I'm happy to report that the simplified reproducer script that I posted above seems to work much more reliably now. However, unfortunately, the original app from which I derived the reproducer still hangs, apparently during the same socket-send. I'll investigate, and will update this bug and/or open a new one.

@sunilmut -- have you (or has anyone else) ever seen WSL's sendmsg() return EAGAIN on a socket that's not non-blocking? (No MSG_DONTWAIT and no O_NONBLOCK.) I believe I'm seeing this, but have not yet managed to reproduce it concisely.

Anyway, I think the reproducer above has been fixed as of build 14376.

Qix- commented

Do you have a zero timeout? As well, it appears it's a valid return even on blocking sockets to indicate something interrupted the blocking call and another call should be issued. TIL.

How frequent is it? It should be innocuous for properly coded applications, though if it's frequent it'd definitely be a performance issue.

@Qix- -- I don't see any syscalls modifying the socket timeout. The EAGAIN is very infrequent. In fact, I see it exactly once; right before my writer hangs / stops writing to its socket.

I don't know that the EAGAIN is necessarily causing the problem. It might just be another symptom. Looking more carefully, it looks like the code probably should be handling EAGAIN's correctly; I'd need the fix to #204 to verify that with greater confidence, though.

I'm still working on reproducing my latest issue. #610 is the closest I've gotten so far, though it's not exactly the same. #611 also has some similarities. I'll keep poking around to see what I can find.

In the meantime, I think it would be fine to close this bug if y'all are so inclined. Any new issues that I run into at this point should probably be filed as new bugs.

#616 is a pretty good tight reproducer of what I'm now seeing. This issue describes something similar, but clearly not the same because this issue has been fixed :-) I'm going to go ahead and close this ticket.

Thanks again for all the hard work that y'all are doing!