eventloop/networking gets stuck on musl
l63l opened this issue ยท 11 comments
Following happens only when building janet with musl libc since 5f2e287 .
When hitting a simple spork/http/server with a load-testing tool like wrk
and about 20 parallel connections, the server stops responding on the
second invocation of wrk
.
And this only happens when using ev/go (which http/server does), but
not with ev/call:
(def a (net/listen "127.0.0.1" "8000"))
(defn handler [con]
(defer (:close con)
(ev/read con 1000)))
# gets stuck:
(ev/go (fn [] (net/accept-loop a handler)))
# doesn't:
# (forever
# (def con (net/accept a))
# (ev/call handler con))
And a client like:
(loop [i :range [0 20]]
(ev/spawn-thread
(forever
(with [con (net/connect "127.0.0.1" "8000" :stream)]
(:write con "AAA")
(def res (:read con 1024))
(print con res)
(:close con)))))
To get a musl build of janet for example on debian:
apt-get install musl-tools
make CC=musl-gcc
Hello. I am using musl build on Alpine, and I have yet to notice the behavior you are describing. I will try your reproduction code and report back. Thanks for reporting.
I have tried it with the latest Janet from the master branch on Alpine Linux with musl at the time of this writing and see no issues like the one you are describing. We will need more information about your setup.
hm, thanks for testing. I tested on void-linux with host musl-libc (1.1.24), alpine 3.17 vm (musl 1.2.3) and debian bullseye with musl-gcc. Now I fetched alpine 3.19.2 and build janet from master and could reproduce it, too.
You need to keep the "server snippet" running, start the "client snippet", kill the "client snippet" and the second invocation will get stuck.
Oh. I did not get the process right. I am trying it again.
Yep. I can confirm that the problem manifests in this case on Alpine.
I'm experiencing similar behavior on Arch with glibc. This is Janet version 1.34.0-f92f3eb6, built from the repo. I don't see the issue on FreeBSD with an equivalent build (from the same commit).
Here's my repro setup:
(var n-req 0)
(defn handler
[stream]
(defer (net/close stream)
(ev/sleep (math/random))
(set n-req (inc n-req))
(printf "%d: %N" n-req (net/peername stream))
(net/write stream (string/format "%3d: Accepted\n" n-req))))
(defn main
[&]
(let [server (net/listen "127.0.0.1" 9000)]
(net/accept-loop server handler)))
In another shell:
for i in $(seq 100)
do nc -d -w 2 127.0.0.1 9000 &
done
Once the accept-loop gets "stuck", the symptom I see is that some (but not all) connections are accepted and sent to the handler and some of those (but not all) result in data actually being written back to the client. The issue seems to accumulate, such that successive batches of 100 requests result in fewer and fewer successful responses.
I tried (forever (ev/call handler (net/accept server)))
and that does seem to work correctly for me as well.
Updated: I see the same behavior with the official build as well.
So took another look at this now that I have repro that works on my system, and the root cause is in how we setup epoll/kqueue in edge trigger mode.
This works well for most cases, but for net/accept-loop
, this can cause connections to be missed in these micro benchmarks where we make many connections back to back.
The fix is conceptual fairly simple, but I would like to make server sockets use edge trigger mode unless we call net/accept-loop, in which they switch to level trigger mode and make sure we fix kqueue too, which seems to have the same wrong logic here.
POC fix that just make all server sockets run in level trigger mode (so not a "real" solution as it would cause a busy loop in the case with (forever (net/accept ...))
diff --git a/src/core/ev.c b/src/core/ev.c
index 335b9738..e373c3f1 100644
--- a/src/core/ev.c
+++ b/src/core/ev.c
@@ -1543,8 +1543,12 @@ static JanetTimestamp ts_now(void) {
/* Wait for the next event */
static void janet_register_stream(JanetStream *stream) {
struct epoll_event ev;
- ev.events = EPOLLET;
- if (stream->flags & (JANET_STREAM_READABLE | JANET_STREAM_ACCEPTABLE)) ev.events |= EPOLLIN;
+ if (stream->flags & JANET_STREAM_ACCEPTABLE) {
+ ev.events = EPOLLIN; /* Do NOT edge trigger server sockets */
+ } else {
+ ev.events = EPOLLET;
+ }
+ if (stream->flags & JANET_STREAM_READABLE) ev.events |= EPOLLIN;
if (stream->flags & JANET_STREAM_WRITABLE) ev.events |= EPOLLOUT;
ev.data.ptr = stream;
int status;
@@ -1695,7 +1699,9 @@ static void timestamp2timespec(struct timespec *t, JanetTimestamp ts) {
void janet_register_stream(JanetStream *stream) {
struct kevent kevs[2];
int length = 0;
- if (stream->flags & (JANET_STREAM_READABLE | JANET_STREAM_ACCEPTABLE)) {
+ if (stream->flags & JANET_STREAM_ACCEPTABLE) {
+ EV_SETx(&kevs[length++], stream->handle, EVFILT_READ, EV_ADD | EV_ENABLE, 0, 0, stream);
+ } else if (stream->flags & JANET_STREAM_READABLE) {
EV_SETx(&kevs[length++], stream->handle, EVFILT_READ, EV_ADD | EV_ENABLE | EV_CLEAR, 0, 0, stream);
}
if (stream->flags & JANET_STREAM_WRITABLE) {
This appears to be working for me. Thanks!
musl builds are also fixed. Thanks.