Crashes with TLS defined
Closed this issue · 8 comments
I've had several occasions where a database has crashed with a sigpipe (broken pipe). It can stay up a month or a day but at some point it crashes. I have TLS defined in options.h and also the verification of peers is enabled. Previously, with these disabled, the same source has ran a db for months a few times without crashing.
The crashes happen when the game decrements the nhandle refcounts in network.cc (I'll get the exact lines as soon as I can). It can attempt to write to a closed connection and cause sigpipe. It tries to push output to the nhandle.
The sigpipe errors I've seen have all happened in a GDB debugging session. Not sure how the server is configured to handle sigpipe, as in, if it's set to ignore it and the use of GDB has caused the server to stop. When TLS was not defined the db was running normally, with build type release as opposed to current (LeakCheck and GDB).
Network.cc, line 319
count = write(h->wfd, b->start, b->length);
is what has caused the sigpipe.
It tries to push something to the connection and it has been closed before the push, although I don't quite get how that'd happen taken that databases have been running for months previously. I am using Fedora Server 37. Seems like it did not happen when TLS was disabled, so I have disabled it again as a temporary measure. It's odd that it would depend on TLS though since that line is not related to TLS connections.
This just happened again. Apparently something to do with threaded DNS lookups, however when I set $server_options.no_name_lookup to 1 the server still breaks a pipe at times. Adding lines like:
if (`connection_name(player) ! ANY' == E_INVARG)
return 0;
endif
if (connection_name(player) == "")
return 0;
endif
to $do_login_command to see if it helps.
Weird, still does it. I'm quite lost here. Disabled printing of messages on a listener to see if it has anything to do with it, since it pushes output to the connection when it disconnects if messages are printed.
Current situation is it is still doing it with TlS disabled, threading disabled, and the stock network.cc file from this repo. Running Fedora Server 37 with Linux Kernel 6.0.12. Disabled IPv6 just in case, I don't think anyone connecting to the database uses it anyway.
More information found. It happens when two connections, in this case -2307 and -2308, connect at the same time from the same IP. Then the latter of them disconnected, instantly after connecting (a probe of some kind) and the former would've probably done so too but before the second disconnection could get logged, the server crashes with a SIGPIPE. Hope this helps.
Closing since proxy rewrite mutex has fixed this.
Yep, works.