mumble-voip/libmumble

Closing TCP connection after SSL_ERROR_SYSCALL

Opened this issue · 6 comments

I've test library with mumble-server based on mumblevoip/mumble-server:v1.4.287

After I have connected (and sent Ping messages to server) I got an SSL_ERROR_SYSCALL in TLS.cpp and then got Shutdown and closed TCP.
You need to be connected a lot of time (from 10 minutes to 10 hours) to reproduce
https://github.com/mumble-voip/libmumble/blob/master/src/TLS.cpp#L184-L190

top of stacktrace:

#0  mumble::SocketTLS::interpretLibCode (this=0x55555566b760, code=0, processed=false, remaining=true) at /PATH/libmumble/src/TLS.cpp:194
#1  0x00007ffff7f72682 in mumble::SocketTLS::read (this=0x55555566b760, buf=...) at /PATH/libmumble/src/TLS.cpp:161
#2  0x00007ffff7f2b21a in mumble::Connection::P::read(gsl::span<std::byte, 18446744073709551615ul>, bool, std::function<bool ()>) (this=0x55555566b760, buf=..., wait=false, halt=...) at /PATH/libmumble/src/Connection.cpp:156
#3  0x00007ffff7f2ac81 in mumble::Connection::process(bool, std::function<bool ()>) (this=0x5555556193c0, wait=false, halt=...) at /PATH/libmumble/src/Connection.cpp:101
#4  0x00007ffff7f591ef in operator() (__closure=0x7ffff0028dc0, event=...) at /PATH/libmumble/src/Peer.cpp:299
#5  0x00007ffff7f59ac1 in operator() (__closure=0x7ffff0028dc0, i=0) at /PATH/libmumble/3rdparty/quickpool/quickpool.hpp:928
#6  0x00007ffff7f5a35b in quickpool::loop::Worker<quickpool::ThreadPool::parallel_for_each<gsl::span<mumble::Monitor::Event>, 

I check ERR_get_error() is 0 and errno is 11 inside SSL_ERROR_SYSCALL case

auto sslError = ERR_get_error();
auto err = errno;

After my little patch a have no random disconnect anymore

diff --git a/src/TLS.cpp b/src/TLS.cpp
index 8ea8176..136f4db 100644
--- a/src/TLS.cpp
+++ b/src/TLS.cpp
@@ -182,6 +182,9 @@ uint32_t SocketTLS::pending() const {
 		case SSL_ERROR_WANT_WRITE:
 			return WaitOut;
 		case SSL_ERROR_SYSCALL:
+			if (ERR_get_error() == 0 && errno == 11) {
+				return Retry;
+			}
 			m_closed = true;
 
 			if (!processed) {

But I'm not sure this is enough and this is idiomatic code for handling ssl problems

Related links from so:
https://stackoverflow.com/questions/13686398/ssl-read-failing-with-ssl-error-syscall-error
https://stackoverflow.com/questions/13554691/errno-11-resource-temporarily-unavailable

After a lot of tests of #8 PR

I found a reproduce steps (do it on same machine):

  1. run libmumble based client and connect to server
  2. open mumble-client
  3. open server list, add same server to favorites
  4. close server list
  5. open server list
  6. repeat 4 and 5 over and over again (on my machine 5-6 times is enough)
  7. libmumble based client will return shutdown code in next line https://github.com/mumble-voip/libmumble/blob/master/src/TLS.cpp#L188

ps. sometimes you need to connect once to server via mumble-client before open-close loop

I think it is another bug besides EAGAIN handle

Also
High ping to server - reduces reproducibility
Set threads to 1 in code = client.startTCP(peerFeedback(), 1); - reduces reproducibility

Thank you very much for the detailed report!

I would expect OpenSSL to handle EAGAIN, but according to your findings it doesn't seem to be the case.

If your pull request effectively fixes the issue, I'm going to merge it right away.

pr #8 just reduces reproducibility but not completely solve the problem
I think it can be merged but it not solve all the problems in this place

OK
Next point is thread safety
Seems like SSL objects can be used via multiple threads without any mutex
When we used this way we got "Requesting crypt-nonce resync" in server logs, and then drop the connection

Now that both #8 and #9 were merged, are there any issues left (that you discovered)?