openssl ssl_accept sometimes does not return, causes server to hang permanently
Closed this issue · 5 comments
Requisite code sample:
require "socket"
require "openssl"
tcp_server = TCPServer.new(8081)
ssl_context = OpenSSL::SSL::Context::Server.new
ssl_context.certificate_chain = "cert.pem"
ssl_context.private_key = "_key.pem"
ssl_server = OpenSSL::SSL::Server.new(tcp_server, ssl_context)
puts "SSL Server listening on #{tcp_server.local_address}"
while true
begin
connection = ssl_server.accept
connection.close rescue nil
rescue ex
print "unable to accept", ex
end
end
I discovered that my server "in production", with similar code underneath, will sometimes hang and stop accepting connections seemingly randomly (incoming connections enter SYN_RECV [TCP backlog]) but no new connections are accepted by the crystal app.
My hypothesis from adding some puts
calls is that it enters the call to SSL_Accept
and never returns. Turns out there there are several reads and writes within SSL_Accept
so if you have a "rogue connection" (or malicious?) that does the TLS handshake "half way" (or even, just connects and does nothing), SSL_Accept won't exit (it's waiting on reads that never come). since the servers are single threaded through the "wrapped accept" portion, this effectively hangs the server. This affects the HTTP server if it uses SSL, etc. HTTP works fine.
Seemingly related discussion here: https://openssl-users.openssl.narkive.com/5QNG9OWX/ssl-accept-hang
AFAICT you can repro it by running the snippet above and connecting and sending nothing, ex: this:
require "socket"
TCPSocket.open("localhost", 8081) { |socket|
puts "connected #{socket}, sleeping"
sleep
}
After running both then all future connections to the server "aren't accepted", ex: openssl s_client -connect localhost:8081
works before you run that, and afterward hangs on connect.
crystal 0.30.1
Ubuntu 19,04
current workaround, in case interesting:
diff --git a/src/openssl/ssl/server.cr b/src/openssl/ssl/server.cr
index b31e53090..1280bc786 100644
--- a/src/openssl/ssl/server.cr
+++ b/src/openssl/ssl/server.cr
@@ -60,7 +60,10 @@ class OpenSSL::SSL::Server
#
# This method calls `@wrapped.accept` and wraps the resulting IO in a SSL socket (`OpenSSL::SSL::Socket::Server`) with `context` configu
ration.
def accept : OpenSSL::SSL::Socket::Server
- OpenSSL::SSL::Socket::Server.new(@wrapped.accept, @context, sync_close: @sync_close)
+ socket = @wrapped.accept
+ socket.as(TCPSocket).read_timeout = 60.seconds
+ socket.as(TCPSocket).write_timeout = 60.seconds
+ OpenSSL::SSL::Socket::Server.new(socket, @context, sync_close: @sync_close)
end
# Implements `::Socket::Server#accept?`.
@@ -68,6 +71,9 @@ class OpenSSL::SSL::Server
# This method calls `@wrapped.accept?` and wraps the resulting IO in a SSL socket (`OpenSSL::SSL::Socket::Server`) with `context` configuration.
def accept? : OpenSSL::SSL::Socket::Server?
if socket = @wrapped.accept?
+ socket.as(TCPSocket).read_timeout = 60.seconds
+ socket.as(TCPSocket).write_timeout = 60.seconds
+
OpenSSL::SSL::Socket::Server.new(socket, @context, sync_close: @sync_close)
end
end
There are some other questions at play, like the fact that it can only accept one client "at a time" so if one client has a slow TLS handshake, the others have to wait for that handshake to fully complete before being accepted (sometimes takes seconds)?
This is actually a case of over-abstraction harming us
SSL_Accept
goes through a BIO which uses normal crystal IO. So if the SSL::*::Socket.new
is called in a fiber per-request, then I'm sure that a single slow client could not block the server, since other fibers could continue working - just like normal crystal IO.
The problem is that the ssl_server.accept
is called from the server fiber, not the per-request fiber. Since we have to accept the socket before having a connection to spawn the fiber.
So the only correct way to have a SSL::Server
is to accept a TCP socket, spawn a fiber for the connection, then handle TLS.
There's no way to do that with a SSL::Server
API which is compatible with TCPServer
.
Otoh I guess we don't want to accept an infinite amount of hanging clients anyway, so we don't exhaust our FDs or something else. So the solution seems two fold, first we want to make sure to not hang indefinitely by making sure the IO timeout is set and works. Second, can we just spawn a pool of fibers to accept and send the socket back through a channel?
clients = Channel(IO).new(MAX_PENDING_CONNECTIONS)
MAX_PENDING_CONNECTIONS.times do
spawn do
loop do
channel.send accept
end
end
end
loop do
spawn handle_client(clients.receive)
end
Idk, seems too easy, I'm probably missing something.
@jhass the number of pending clients can be controlled through read_timeout
and write_timeout
, if these are set before passing the socket to SSL::*::Socket.new
.
I don't like the idea of adding all these fibers and complexity just to maintain the API style. To me, it seems much simpler to embrace the fact that SSL::*::Socket.new
is a blocking call and remove SSL::Server
.
unrelated: Crystal maybe should set so_keepalive by default so that people who "forget" to set a read timeout (and the connection is lost somehow) don't accidentally orphan fibers...forever...just an idea.