Threads stuck at READ-INITIAL-REQUEST-LINE
casouri opened this issue · 34 comments
I noticed that many threads hang around for a very long time and eventually there are too many of them that the server can't accept new requests. All of them hang in HUNCHENTOOT::READ-INITIAL-REQUEST-LINE
. The read and write timeout for the acceptor is the default 20. I couldn't figure out what's going on, the functions in the call chain don't seem to have timeout parameters.
Here is the stackframe of one of the threads:
0: ("bogus stack frame")
1: (SB-SYS:WAIT-UNTIL-FD-USABLE 11 :INPUT NIL T)
2: ((:METHOD STREAM-READ-BYTE (CL+SSL::SSL-STREAM)) #<unavailable argument>) [fast-method]
3: (READ-BYTE #<CL+SSL::SSL-SERVER-STREAM for #<FD-STREAM for "socket 23.239.22.57:4386, peer: 185.26.53.196:44126" {100421C383}>> NIL NIL)
4: (READ-CHAR* #<CL+SSL::SSL-SERVER-STREAM for #<FD-STREAM for "socket 23.239.22.57:4386, peer: 185.26.53.196:44126" {100421C383}>> NIL NIL)
5: (READ-LINE* #<CL+SSL::SSL-SERVER-STREAM for #<FD-STREAM for "socket 23.239.22.57:4386, peer: 185.26.53.196:44126" {100421C383}>> NIL)
6: (HUNCHENTOOT::READ-INITIAL-REQUEST-LINE #<CL+SSL::SSL-SERVER-STREAM for #<FD-STREAM for "socket 23.239.22.57:4386, peer: 185.26.53.196:44126" {100421C383}>>)
7: (HUNCHENTOOT::GET-REQUEST-DATA #<CL+SSL::SSL-SERVER-STREAM for #<FD-STREAM for "socket 23.239.22.57:4386, peer: 185.26.53.196:44126" {100421C383}>>)
8: ((:METHOD HUNCHENTOOT:PROCESS-CONNECTION (HUNCHENTOOT:ACCEPTOR T)) #<BLOG-SERVER::SSL-SERVER (host *, port 4386)> #<USOCKET:STREAM-USOCKET {100421C4C3}>) [fast-method]
9: ((:METHOD HUNCHENTOOT:PROCESS-CONNECTION :AROUND (HUNCHENTOOT:ACCEPTOR T)) #<BLOG-SERVER::SSL-SERVER (host *, port 4386)> #<USOCKET:STREAM-USOCKET {100421C4C3}>) [fast-method]
10: ((FLET HUNCHENTOOT::PROCESS-CONNECTION% :IN HUNCHENTOOT::HANDLE-INCOMING-CONNECTION%) #<BLOG-SERVER::SSL-SERVER (host *, port 4386)> #<USOCKET:STREAM-USOCKET {100421C4C3}>)
11: ((LAMBDA NIL :IN BORDEAUX-THREADS::BINDING-DEFAULT-SPECIALS))
12: ((FLET SB-UNIX::BODY :IN SB-THREAD::RUN))
13: ((FLET "WITHOUT-INTERRUPTS-BODY-11" :IN SB-THREAD::RUN))
14: ((FLET SB-UNIX::BODY :IN SB-THREAD::RUN))
15: ((FLET "WITHOUT-INTERRUPTS-BODY-4" :IN SB-THREAD::RUN))
16: (SB-THREAD::RUN)
17: ("foreign function: call_into_lisp")
18: ("foreign function: funcall1")
Don't think there's enough information to diagnose.
Here's some more info I found out: notice the timeout variable is nil
.
1: (SB-SYS:WAIT-UNTIL-FD-USABLE 12 :INPUT NIL T)
Locals:
SB-IMPL::DIRECTION = :INPUT
SB-IMPL::FD = 12
SB-IMPL::SERVE-EVENTS = T
SB-IMPL::SIGNALP = NIL
SB-IMPL::STOP-SEC = NIL
SB-IMPL::STOP-USEC = NIL
TIMEOUT = NIL
SB-IMPL::TO-SEC = NIL
SB-IMPL::TO-USEC = NIL
So STREAM-READ-BYTE
eventually calls this:
(defun input-wait (stream fd deadline)
(declare (ignore stream))
(let ((timeout
;; *deadline* is handled by wait-until-fd-usable automatically,
;; but we need to turn a user-specified deadline into a timeout
(when deadline
(/ (- deadline (get-internal-real-time))
internal-time-units-per-second))))
(sb-sys:wait-until-fd-usable fd :input timeout)))
Ok, the stream object's deadline
attribute is nil
. That's why there is no timeout.
Does your acceptor have a timeout?
Yeah, both read and write timeout are 20.
Probably doesn't work with ssl.
Ok, so the problem is in cl+ssl:
(defun make-ssl-server-stream
(socket &key certificate key password method external-format
close-callback (unwrap-stream-p t)
(cipher-list *default-cipher-list*))
"Returns an SSL stream for the server socket descriptor SOCKET.
CERTIFICATE is the path to a file containing the PEM-encoded certificate for
your server. KEY is the path to the PEM-encoded key for the server, which
may be associated with the passphrase PASSWORD."
(ensure-initialized :method method)
(let ((stream (make-instance 'ssl-server-stream
:socket socket
:close-callback close-callback
:certificate certificate
:key key)))
(with-new-ssl (handle)
(setf socket (install-handle-and-bio stream handle socket unwrap-stream-p))
(ssl-set-accept-state handle)
(when (zerop (ssl-set-cipher-list handle cipher-list))
(error 'ssl-error-initialize :reason "Can't set SSL cipher list"))
(with-pem-password (password)
(install-key-and-cert handle key certificate))
(ensure-ssl-funcall stream handle #'ssl-accept handle)
(handle-external-format stream external-format))))
When they create the steam from the socket they probably should determine the deadline attribute from the socket's timeout.
Thanks!
Well, hunchentoot sets timeouts afterwards, with set-timeouts, so maybe it's both.
I don't know much about how all these works, but I think it might be a different timeout.
The code path is cl+ssl:stream-read-byte
-> cl+ssl:ensure-ssl-funcall
-> input-wait
-> sb-sys:wait-until-fd-usable
.
@casouri , the make-ssl-server-stream
is not part of the stacktrace you provided.
So in this scenario it's irrelevant (maybe in another scenario it is relevant).
According to @stassats , hunchentoot set's timeout of ssl socket after it is created.
But in your stacktrace we see the timeout is absent on the :SSL-SERVER-STREAM
object and the SB-SYS:WAIT-UNTIL-FD-USABLE
is called without timeout.
After brief study of hunchentoot code, it seems hunchentoot does not set cl+ssl::ssl-stream-deadline
on the cl+ssl:ssl-server-stream
object.
It simply sets the timeouts on the underlying socket.
And later sb-sys:wait-until-fd-usable
ignores that timeout, when called with the 3rd argument NIL
. cl+ssl passes NIL
because the deadline
slot is NIL
for cl+ssl:ssl-server-stream
object.
@casouri , the make-ssl-server-stream is not part of the stacktrace you provided.
So in this scenario it's irrelevant (maybe in another scenario it is relevant).
Though make-ssl-server-stream
creates the stream object, and the deadline
slot being NIL in that object caused the problem.
After brief study of hunchentoot code, it seems hunchentoot does not set cl+ssl::ssl-stream-deadline on the cl+ssl:ssl-server-stream object.
Yes, because the stream is created by make-ssl-server-stream
, and the socket passed to it has a timeout slot, I thought cl+ssl could set the deadline
automatically. Maybe hunchentoot should set a deadline for the cl+ssl-server-steam like it did for the socket. @stassats WDYT?
Do you have enough to make a pull request?
Yes, more or less. I have patched it up like this:
(defmethod initialize-connection-stream :around
((acceptor ssl-acceptor) stream)
(let ((ssl-stream (call-next-method acceptor stream)))
(setf (cl+ssl::ssl-stream-deadline ssl-stream)
(+ (get-internal-real-time)
(* internal-time-units-per-second 20)))
ssl-stream))
It sets :deadline
according to :timeout
. Maybe we can modify initialize-connection-stream
as such?
Actually that's only a patch, it doesn't use the :timeout
of the socket. I'll look into it further.
Ok, here:
(defmethod initialize-connection-stream ((acceptor ssl-acceptor) stream)
;; attach SSL to the stream if necessary
(let ((ssl-stream (cl+ssl:make-ssl-server-stream stream
:certificate (acceptor-ssl-certificate-file acceptor)
:key (acceptor-ssl-privatekey-file acceptor)
:password (acceptor-ssl-privatekey-password acceptor))))
(setf (cl+ssl::ssl-stream-deadline ssl-stream)
(+ (get-internal-real-time)
(* internal-time-units-per-second (acceptor-read-timeout ssl-acceptor))))
(call-next-method acceptor ssl-stream)))
I don't know about the internals and overall structure of hunchentoot, so I'm not sure if it's the most appropriate place to set that :deadline
.
@casouri , setting fixed deadline for connection stream when initializing it will lead to an error signaled when handling one of the next requests on this connection - not what we want.
@stassats, @casouri, I propose the following approach (draft for now): avodonosov@3a1297e
It's based on these two comments: cl-plus-ssl/cl-plus-ssl#69 (comment) and the reply
cl-plus-ssl/cl-plus-ssl#69 (comment)
CC @hanshuebner
Thanks for taking the time!
I am exploring this approach, calling close
on the socket does not work, and incorrect if a system call is in progress, as it turns out.
The shutdown
call is the proper way to close the socket and interrupt the waiting. I am testing with CCL, which uses poll
for ccl::process-input-wait
that is called by cl+ssl::input-wait
Several links on how and why close
does not work:
https://stackoverflow.com/questions/49613058/tcp-socket-stays-in-established-state-after-close-by-server
https://stackoverflow.com/questions/14293254/tcp-fin-not-sent-on-doing-close-for-a-multi-threaded-tcp-client
https://stackoverflow.com/questions/10561602/closing-a-file-descriptor-that-is-being-polled
https://stackoverflow.com/questions/5039608/poll-cant-detect-event-when-socket-is-closed-locally
https://stackoverflow.com/questions/14212584/socket-descriptor-not-getting-released-on-doing-close-for-a-multi-threaded
Despite some of the comments in these links, the shutdown
interrupts the waiting even if the other side does not send FIN packed, I tested that. So the solution does not depend on the client behavior.
Looking at your patch @avodonosov, it seems like a pretty ugly solution. Wouldn't it be possible to patch hunchentoot so it sets the cl+ssl deadline whenever it sets the socket timeout?
@fjl, I propose that approach because it does not depend on lisp implementation behavior in several layers of lisp libraries, unlike the socket timeouts / cl+ssl deadlines. Why do you think it's ugly?
Speaking of the deadlines, first, they are not really supported by cl+ssl. Also note, deadline and timeout have different meaning, so deadline can not be set at the same place where timeout is set. Deadline is an absolute time, it would need to be set for every request, maybe at several stages of request processing.
Essentially, the function I introduced in the draft is a deadline setting (I even considered naming it correspondingly) - after certain absolute time the IO is aborted (implemented as socket shutdown). It is used for 3 stages: stream setup (ssl handshake), request reading, request processing with response writing.
Why do you think it's ugly?
Hope you're not offended by the term. I just said ugly because the solution in your draft didn't match my intution for the problem.
Having looked into it more now, I can definitely see where you're coming from with this, but there are still a couple issues to consider before settling on your approach. Specifically, I'm thinking of these two things:
(1) In HTTP servers, generally, there are three distinct timeouts to consider (you also have them in your draft!): the client must provide the headers / request line within a short time after opening the connection. A fixed timeout is OK for this, and a deadline would also work (it can just be current time + fixed timeout). Once the headers are in, things get more complicated because the size of the data that will follow is not of fixed size. Two solutions possible here: put the application in charge of the timeout, or use some kind of 'idle timeout' that ensures at least some amount of data will flow within the given time. Either of these is necessary to prevent the slowloris attack. A fixed timeout (or deadline) is not great because it will not work with large file uploads, for example. Finally, there should be timeout on delivering the response data back to the client, because the client could otherwise tie up the connection. Here, a fixed timeout / deadline is also inappropriate because the response may be very large.
(2) Now, hunchentoots timeout handling isn't all that great. It does set timeouts, but if / how these are handled is kind of up to the lisp implementation. Also, since the client may get a wrapped stream from functions like RAW-POST-DATA, there is really no way to set the timeout on the application side. This is why I think the 'idle timeout' approach would work best for hunchentoot. It prevents the attack with a stuck socket, while also allowing the app not to care about timeouts at all while reading request body text.
IMHO this is a problem that should be solved by the usocket API, somehow. Ideally, usocket would provide methods on the stream returned by SOCKET-STREAM to set the read / write idle timeouts that apply to all future calls of READ-BYTE and READ-SEQUENCE on the stream. Working around the fact that there is no such API, hunchentoot has set-timeouts.lisp.
With CL+SSL, it gets even more complicated because it operates on a stream, not a socket. So not only is there no defined way for it to set the timeouts on the underlying socket, there is also no obvious way for library users to set the timeout on the SSL stream (except by calling custom CL+SSL methods). This, too, could be fixed by adding a stream timeout API to usocket. If such API existed, CL+SSL could: have a stream class specific to usocket, implement those timeout methods for usocket-based SSL streams (just pass them through), and thus allow callers to set the timeout generically for both SSL and non-SSL streams.
Mostly, what I'm getting at here is: hunchentoot tries to paper over a lot of library issues already, and while creating another hack to make it work will probably fix the issues for hunchentoot, setting timeouts will still be a nightmare for any other application. We don't even need to look very far beyond hunchentoot for this, just think of hunchensocket. It also needs timeouts, and they wouldn't work over SSL with the hack. Would be much nicer to find a solution that improves things for all applications and then use that to improve / simplify hunchentoot.
You should raise that over cl+ssl/usocket, there is nothing hunchentoot can do. In the mean time, hunchentoot need to fix this problem to be useable, regardless how "ugly" you think it is.
If you need a fix short-term, you should really consider running hunchentoot behind a proxy. I personally use caddy as the SSL frontend for hunchentoot. You can also set timeouts on the proxy and offload static file serving to it.
A little update.
The cl+ssl fix for cl-plus-ssl/cl-plus-ssl#133 (already in Quicklisp) restores Lisp BIO functionality, which means instead of giving OpenSSL a file descriptor of the lisp stream, we can wrap the lisp stream itself, thus inheriting all the timeout semantics. So the hunchentoot threads on inactive connections will be unblocked by IO timeout in the read-initial-request-line
or any other hunchentoot function.
To utilize this functionality either globally (setq cl+ssl:*default-unwrap-stream-p* nil)
or override the hunchentoot:initialize-connection-stream
for your acceptor, and inside of this method add unwrap-stream-p nil
parameter to the call to cl+ssl:make-ssl-server-stream
:
Line 83 in 460a32c
Maybe in the future timeout support can be added to cl+ssl in the default mode were we pass the file descriptor to OpenSSL. But I am not sure of interface for this. Maybe it's better to drop the current timeouts / deadlines support, which tries to signal a lisp implementation specific condition, and introduce an explicit timeout parameter and signal a new type of error condition.
Also I still think it may be good to have a kind of health-check for hunchentoot that detects stuck threads and logs them, or optionally tries to terminate them by shutting down the connection socket. It will be useful in case the timeout functionality degrades in the future, or for slow request attack, where 1 byte is sent every 15 seconds, for example, so that timeout does not happen, and the tread remains occupied by this connection.
I have a branch where such a stuck thread monitor is implemented. It logs stuck threads, and optionally can shutdown their sockets automatically in order to unblock them. https://github.com/edicl/hunchentoot/compare/master...avodonosov:ssl-thread-leak?expand=1
@stassats and other maintainers, do you think that's a useful feature for hunchentoot?
The implementation in my branch tries to touch hucnhentoot core as little as possible - only introduces a generic function track-progress
that is invoked from appropriate places of process-connection
. And a mixin class that acceptors can derive from, that
provides a method for the track-progress
function and implements the monitoring described above.
The mixin is implemented in a separate package hunchentoot-stuck-connection-monitor
, so in the source code it is easy to see what hunchentoot functionality it depends on by searching for the hunchentoot:
package mentions in the hunchentoot-stuck-connection-monitor.lisp.
I think I'd prefer working timeouts.
It would be good if hunchentoot at least exposed atrack-progress
hook, as suggested in my branch:
https://github.com/edicl/hunchentoot/compare/master...avodonosov:ssl-thread-leak?expand=1#diff-93bc171f0cfbcb7ee4b05aef82f624cfab8be66630383c1d814f4d13557943bfR513
That would simplify creating 3rd party modules for health-check and monitoring.
I am packaging my solution as a separate module anyways: https://github.com/avodonosov/hunchentoot-stuck-connection-monitor
But it took copy-pasting the whole hunchentoot:process-connection
method: https://github.com/avodonosov/hunchentoot-stuck-connection-monitor/blob/main/acceptor-hook.lisp#L45
I'm bitten by this problem, and it's the only thing that occasionally stops service from my otherwise solid hunchetoot server. I might have a look, but maybe somebody with hunchentoot/cl+ssl knowledge could offer some updated insight into this?
This patch could help: cl-plus-ssl/cl-plus-ssl#157
I might have a look
Look where? You mean investigate the issue?
This thead provides several solutions already.
Look where? You mean investigate the issue?
Yes.
This thead provides several solutions already.
I don't quite understand. Are you saying this is not a flaw in Hunchentoot?
@frodef, it is a defect (in hunchentoot, or in cl+ssl, or in CL, or in the universe...)
This thread mentions 4 solutions, from very easy (one two lines) to little less easy. Do you mean none of them work for you? Or it's difficult to find them in the discussion?
I am afraid, if I ttry to summarize them, it will be another wordy comment, making the thread even longer and more difficult to read.