openssl ssl_accept sometimes does not return, causes server to hang permanently

Question

openssl ssl_accept sometimes does not return, causes server to hang permanently

Closed this issue 4 years ago · 5 comments

Requisite code sample:

require "socket"
require "openssl"

tcp_server = TCPServer.new(8081)

ssl_context = OpenSSL::SSL::Context::Server.new
ssl_context.certificate_chain = "cert.pem"
ssl_context.private_key = "_key.pem"
ssl_server = OpenSSL::SSL::Server.new(tcp_server, ssl_context)

puts "SSL Server listening on #{tcp_server.local_address}"
while true
    begin
      connection = ssl_server.accept
      connection.close rescue nil
    rescue ex
       print "unable to accept", ex
    end
end

I discovered that my server "in production", with similar code underneath, will sometimes hang and stop accepting connections seemingly randomly (incoming connections enter SYN_RECV [TCP backlog]) but no new connections are accepted by the crystal app.

My hypothesis from adding some puts calls is that it enters the call to SSL_Accept and never returns. Turns out there there are several reads and writes within SSL_Accept so if you have a "rogue connection" (or malicious?) that does the TLS handshake "half way" (or even, just connects and does nothing), SSL_Accept won't exit (it's waiting on reads that never come). since the servers are single threaded through the "wrapped accept" portion, this effectively hangs the server. This affects the HTTP server if it uses SSL, etc. HTTP works fine.

Seemingly related discussion here: https://openssl-users.openssl.narkive.com/5QNG9OWX/ssl-accept-hang

AFAICT you can repro it by running the snippet above and connecting and sending nothing, ex: this:

require "socket"
TCPSocket.open("localhost", 8081) { |socket|
  puts "connected #{socket}, sleeping"
  sleep
}

After running both then all future connections to the server "aren't accepted", ex: openssl s_client -connect localhost:8081 works before you run that, and afterward hangs on connect.

crystal 0.30.1

Ubuntu 19,04

Answer 1 · 2019-08-22T07:08:20.000Z

current workaround, in case interesting:

diff --git a/src/openssl/ssl/server.cr b/src/openssl/ssl/server.cr
index b31e53090..1280bc786 100644
--- a/src/openssl/ssl/server.cr
+++ b/src/openssl/ssl/server.cr
@@ -60,7 +60,10 @@ class OpenSSL::SSL::Server
   #
   # This method calls `@wrapped.accept` and wraps the resulting IO in a SSL socket (`OpenSSL::SSL::Socket::Server`) with `context` configu
ration.
   def accept : OpenSSL::SSL::Socket::Server
-    OpenSSL::SSL::Socket::Server.new(@wrapped.accept, @context, sync_close: @sync_close)
+    socket = @wrapped.accept
+    socket.as(TCPSocket).read_timeout = 60.seconds
+    socket.as(TCPSocket).write_timeout = 60.seconds
+    OpenSSL::SSL::Socket::Server.new(socket, @context, sync_close: @sync_close)
   end
 
   # Implements `::Socket::Server#accept?`.
@@ -68,6 +71,9 @@ class OpenSSL::SSL::Server
   # This method calls `@wrapped.accept?` and wraps the resulting IO in a SSL socket (`OpenSSL::SSL::Socket::Server`) with `context` configuration.
   def accept? : OpenSSL::SSL::Socket::Server?
     if socket = @wrapped.accept?
+      socket.as(TCPSocket).read_timeout = 60.seconds
+      socket.as(TCPSocket).write_timeout = 60.seconds
+
       OpenSSL::SSL::Socket::Server.new(socket, @context, sync_close: @sync_close)
     end
   end

There are some other questions at play, like the fact that it can only accept one client "at a time" so if one client has a slow TLS handshake, the others have to wait for that handshake to fully complete before being accepted (sometimes takes seconds)?

Answer 2 · 2019-09-05T21:32:51.000Z

This is actually a case of over-abstraction harming us

SSL_Accept goes through a BIO which uses normal crystal IO. So if the SSL::*::Socket.new is called in a fiber per-request, then I'm sure that a single slow client could not block the server, since other fibers could continue working - just like normal crystal IO.

The problem is that the ssl_server.accept is called from the server fiber, not the per-request fiber. Since we have to accept the socket before having a connection to spawn the fiber.

So the only correct way to have a SSL::Server is to accept a TCP socket, spawn a fiber for the connection, then handle TLS.

There's no way to do that with a SSL::Server API which is compatible with TCPServer.

Answer 3 · 2019-09-05T22:39:59.000Z

Otoh I guess we don't want to accept an infinite amount of hanging clients anyway, so we don't exhaust our FDs or something else. So the solution seems two fold, first we want to make sure to not hang indefinitely by making sure the IO timeout is set and works. Second, can we just spawn a pool of fibers to accept and send the socket back through a channel?

clients = Channel(IO).new(MAX_PENDING_CONNECTIONS)
MAX_PENDING_CONNECTIONS.times do
  spawn do
    loop do
      channel.send accept
    end
  end
end

loop do
  spawn handle_client(clients.receive)
end

Idk, seems too easy, I'm probably missing something.

Answer 4 · 2019-09-16T13:04:41.000Z

@jhass the number of pending clients can be controlled through read_timeout and write_timeout, if these are set before passing the socket to SSL::*::Socket.new.

I don't like the idea of adding all these fibers and complexity just to maintain the API style. To me, it seems much simpler to embrace the fact that SSL::*::Socket.new is a blocking call and remove SSL::Server.

Answer 5 · 2020-02-21T13:58:10.000Z

unrelated: Crystal maybe should set so_keepalive by default so that people who "forget" to set a read timeout (and the connection is lost somehow) don't accidentally orphan fibers...forever...just an idea.