corretto/corretto-11

Segfault in libnet.so at high number of open connections

JohnMurray opened this issue · 4 comments

Describe the bug

We observed a segfault in the JVM on an Apache Trino server running at a high connection count. The appears to be happening while opening a new network connection (see attached log for the specific stack trace).

To Reproduce

The service is the OSS version of Trino (version 358). We have observed issue when the server reaches approximately 186k connections. This has been repeatable in our observation of live traffic, but we have not attempted a synthetic load to reproduce in a non-production environment.

Expected behavior

To the best of our knowledge, there were no system limitations encountered (e.g. fd limits). Thus the expectation is that the JVM should be able to continue opening new connections. It's also the expectation that failures to open new connections should be signaled through exceptions that application-code can respond to.

Screenshots

n/a

Platform information

OS Information

Ubuntu 20.04.6
uname: 5.15.0-1038-aws #43~20.04.1-Ubuntu SMP Fri Jun 2 17:11:42 UTC 2023 aarch64 aarch64 aarch64 GNU/Linux

Version Information

openjdk version "11.0.19" 2023-04-18 LTS
OpenJDK Runtime Environment Corretto-11.0.19.7.1 (build 11.0.19+7-LTS)
OpenJDK 64-Bit Server VM Corretto-11.0.19.7.1 (build 11.0.19+7-LTS, mixed mode)

Additional context

hs_err_pid1290261.log

Thanks for reporting this bug and we’re investigating it, we’ll update this issue when we know more.

Judging from si_addr:

siginfo: si_signo: 11 (SIGSEGV), si_code: 1 (SEGV_MAPERR), si_addr: 0x000000000000219d

...the failing call stack:

Stack: [0x0000ff898a3fe000,0x0000ff898a5fe000],  sp=0x0000ff898a5f0010,  free space=1992k
Native frames: (J=compiled Java code, A=aot compiled Java code, j=interpreted, Vv=VM code, C=native code)
C  [libpthread.so.0+0x9c00]  pthread_mutex_lock+0x10
C  [libnet.so+0xfc18]  NET_Poll+0x74
C  [libnet.so+0xcd74]  Java_java_net_PlainSocketImpl_socketConnect+0x1e4
J 44378  java.net.PlainSocketImpl.socketConnect(Ljava/net/InetAddress;II)V java.base@11.0.19 (0 bytes) @ 0x0000ffff802e61cc [0x0000ffff802e6140+0x000000000000008c]
C  0x0000000000000001

...and the code in NET_Poll in 11u that ends up in this macro:

/*
 * Macro to perform a blocking IO operation. Restarts
 * automatically if interrupted by signal (other than
 * our wakeup signal)
 */
#define BLOCKING_IO_RETURN_INT(FD, FUNC) {      \
    int ret;                                    \
    threadEntry_t self;                         \
    fdEntry_t *fdEntry = getFdEntry(FD);        \
    if (fdEntry == NULL) {                      \
        errno = EBADF;                          \
        return -1;                              \
    }                                           \
    do {                                        \
        startOp(fdEntry, &self);                \
        ret = FUNC;                             \
        endOp(fdEntry, &self);                  \
    } while (ret == -1 && errno == EINTR);      \
    return ret;                                 \
}

...I suspect that getFdEntry returned something close to nullptr, and so we crashed trying to use the mutex out of it.

I see no relevant bug reports that are not fixed in 11u. The old socket implementation that uses this code was removed in JDK 18u in favor of new socket implementation given by https://openjdk.org/jeps/353. So if there is a fix, it would be 11u and 17u-specific, and would likely only reproduce on 17u with -Djdk.net.usePlainSocketImpl.

Can you upgrade to JDK 17u? If not, I see the crash report is from Jenkins, maybe you can run with fastdebug binaries to get more verbose diagnostics for the issue?

I agree with @shipilev. From the hs_err file we see that we are indeed at the beginning of pthread_mutex_lock and are de-referencing the __data.__kind field from that pthread_mutex_t struct which is at offset 0x10. The mutex argument is passed in register r0 (0x000000000000218d) and is clearly wrong (i.e. too small):

pthread_mutex_lock(pthread_mutex_t *mutex)
0x0000ffffbb38fbf0:  FD 7B BD A9    stp  x29, x30, [sp, #-0x30]!
0x0000ffffbb38fbf4:  FD 03 00 91    mov  x29, sp
0x0000ffffbb38fbf8:  F5 5B 02 A9    stp  x21, x22, [sp, #0x20]
0x0000ffffbb38fbfc:  15 40 00 91    add  x21, x0, #0x10          // mutex = R0 = 0x000000000000218d
0x0000ffffbb38fc00:  A2 02 40 B9    ldr  w2, [x21]
0x0000ffffbb38fc04:  E1 2F 80 52    movz w1, #0x17f
0x0000ffffbb38fc08:  41 00 01 0A    and  w1, w2, w1
0x0000ffffbb38fc0c:  1F 20 03 D5    nop

So finally we end ab accessing 0x000000000000218d + 0x10 (which is 0x000000000000219d) and crash.

It is unclear why this happens, but it seems like the internal fdTable data was corrupted.
From the hs_erro file I see that your limit on open file descriptor is set to 262144 and because sour issue always seems to appear after more than 186k connections, this might be related to the overflow table implementation.

As suggested by @shipilev , can you please try to run with a fastdebug build, as that has assertions enabled and might provide some more diagnostics? You can get a fastdebug build of Corretto 11 from here.

Thank you @simonis and @shipilev for the detailed response! ❤️ We've been unable to reproduce the issue, so I unfortunately don't have more information at the moment. For now, knowing that this seems to be FD related is very helpful.

We will be moving to Java 17 in the medium term. So, if either we see the issue recurring in our production environment, or we see the issue recur with Java 17, I'll open a new issue with more detailed information from a fastdebug build.