iande/onstomp

IO hangs if reading fails during CONNECT/CONNECTED handshake

Closed this issue · 4 comments

The IO loop can run indefinitely if the connection is lost during the CONNECT/CONNECTED handshake. This is largely due to the fact that while io_process_read will trigger closing events, none have been bound to the connection at this stage. The problem is exacerbated by the fact that the read timeout check also only triggers an event without raising an exception, which causes this block in the connect method:

broker_con = nil
until broker_con
  io_process_read(true) { |f| broker_con ||= f }
end

to run indefinitely.

Thanks to Paul Gale for bringing this issue to my attention.

Two possibilities to consider:

  1. Install the connection event bindings from the client for the initial connection instead of only installing them on the final connection.
  2. Re-raise read exceptions when connecting.

The first option won't fully solve the problem, the aforementioned read loop will still run indefinitely. The second option would resolve the problem, allowing a reconnect handler to rescue the re-raised exception and attempt to connect again.

I still have a concern with installing the event handlers before the CONNECT/CONNECTED exchange is complete: those events will be triggered from within the same thread that performs client.connect. Consider the following

def connect_to_broker c
  begin
    c.connect
  rescue
    sleep 2
    retry
  end
end

client = OnStomp::Client.new(...)
client.on_connection_closed do
  connect_to_broker client
end

connect_to_broker client
# More code ...

If the TCP/IP connection is refused, that error is raised before connect is called on the OnStomp::Connections::Stomp_1_0 instance created by OnStomp::Connections.connect, thus no event handlers are installed, and everything works as expected. However, if the TCP/IP connection is established, and an EOF error occurs during the CONNECT/CONNECTED handshake, two things will happen:

  1. The on_connection_closed handler will be invoked.
  2. An EOFError will be raised

So, let's suppose an EOFError occurs (or the broker takes too long to respond) while waiting for the CONNECTED frame. The callback is invoked, which calls client.connect again. Suppose this time, the connection succeeds, the handler completes, the client is connected with a brand new connection object. We unwind back into the first connection, which then raises an exception, this in turn is rescued by the begin/end block in our initial call to connect_to_broker, which in turn sleeps for 2 seconds, and then attempts to reconnect the client again. This time, the threaded processor of the client is up and running, so it is stopped before we reconnect, and then yet another connection is created. Assuming this one also connects properly, we continue execution at #More code ...

There are two main problems here. The first is that stopping the threaded processor does not close the connection, so the first successful connection is still open. However, since it is not attached to an IO loop, it will likely remain open for the duration of the program. The second problem is that we had a perfectly good connection and then threw it away when the reconnect code made the final call to c.connect. Both issues could be prevented, in this case, by first checking the client's connection status before attempting to connect, e.g.:

begin
  client.connect
rescue
  unless client.connected?
    sleep 2
    retry
  end

end

or by a fast return in Client#connect:

def connect headers={}
  return if connected?
  # ...
end

However, this example is assuming the reconnect handler operates on the main thread only. If we used a separate stopped thread to make the connection, waking it up when a connection attempt was needed, we could still create the problem since Client#connect is not synchronized.

Let's see if simply raising exceptions during the CONNECT/CONNECTED process is sufficient.

To further improve this, we'll raise a new error, ConnectionTimeoutError < FatalConnectionError, if the read timeout is exceeded during the CONNECT/CONNECTED phase, and default read timeout to 120 seconds.

v1.0.7 has addressed this problem, though the solution could be a little more elegant.