onebeyond/rascal

Rascal doesn't reconnect when connection with the broker is dropped

Zoddo opened this issue · 6 comments

Zoddo commented

Description

When the connection with the broker is closed unexpectedly (for example, when the broker is restarted), Rascal never attempts to reconnect to the broker.

Expected Behavior

Rascal try to reconnect to the broker.

Actual Behavior

Rascal log a connection error in debug mode, but never reconnect.

Possible Fix

Steps to Reproduce

  1. Initialize a Rascal instance like below:
const broker = await Rascal.createBrokerAsPromised(
  Rascal.withDefaultConfig({
    vhosts: {
      '/': {
        connection: {
          url: process.env.AMQP_URI,
          socketOptions: {
            clientProperties: { connection_name: hostname(), service: 'Worker' },
          },
        },
        queues: {
          'test.queue': {
            options: {
              durable: false,
              messageTtl: 60000,
              maxPriority: 1,
            },
          },
        },
      },
    },
  }),
);

const subscription = await this.broker.subscribe('/test.queue', {
  contentType: 'application/json',
});

subscription
  .on('message', async (_message, _content, ackOrNack) => {
    ackOrNack();
  })
  .on('invalid_content', (err, _message, ackOrNack) => {
    ackOrNack(err);
  });
  1. Start the application with DEBUG='rascal:*'
  2. Once connected, restart the broker.
  3. The following lines are logged, but Rascal never attempts to reconnect:
rascal:SubscriberSession Removing channel: 7d7e11bf-bcda-411e-98ac-84e112ce8bd2 from session +11s
rascal:Vhost Handling connection error: Connection closed: 320 (CONNECTION-FORCED) with message "CONNECTION_FORCED - broker forced connection closure with reason 'shutdown'" initially from connection: fdf8dae3-d74e-49c0-831a-615bb402f4e5, amqp://127.0.0.1:5672?heartbeat=10&connection_timeout=10000&channelMax=100 +11s

Context

I also tried to explicitly define the retry option on the connection, without success.

Your Environment

  • Version used: 18.0.0
  • Environment name and version (e.g. Chrome 39, node.js 5.4): Node 20.12.0

Hi @Zoddo. Have you added error handlers as described in the very important section about error handling?

I added the following code to your example, and the app restarts the connection as expected

  broker.on('error', (err) => {
    console.error({ err })
  });

However I'm confused as to why the application doesn't crash without this. The error event documentation still say

If an EventEmitter does not have at least one listener registered for the 'error' event, and an 'error' event is emitted, the error is thrown, a stack trace is printed, and the Node.js process exits.

I'll continue to investigate

Zoddo commented

Oh, I have an uncaughtException handler, that's why my app didn't crash (and why I didn't explicitly register an error listener).

I believe you check if an error listener is attached to the broker instance and don't automatically reconnect if none are attached?

I confirm explicitly attaching an error listener to the broker instance fix the issue.

Hi @Zoddo,

Rascal doesn't check whether an error listener is attached. The code simply emits an error then schedules the retry. I don't think that uncaughtException handlers catch error events, so it should have crashed your application.

I worked out what's going on.

amqplib is catching the error in what it calls it's main accept loop and [re-emitting]((https://github.com/amqp-node/amqplib/blob/bbe579e467866a40ff1c0ae2428c416111340364/lib/connection.js#L429) it as a frame error. This is handled by the socket error code, which expects a socket close due to the connection having been dropped, and does nothing.

I'll change Rascal to emit the error event (and schedule the reconnection) from within a setImmediate. This will cause the error bypasses amqplib's main accept loop and bubble up to the process as expected.

I've tested locally and it is caught by the uncaughtException handler

Hi @Zoddo

Rascal@19.0.0 emits the error events via setImmediate. I issued a major release in case there are unintended side effects, but I can't see a reason why there would be. You do need to add the explicit error handlers, when I tested using a global uncaughtException handler, it caught the errors but prevented the reconnection / resubscription code from running.

In making the change, I spotted that I had been handling channel close and connection error events incorrectly - the code was shared with the channel close and connection error, and expected an error event. Consequently it threw an exception that was being swallowed by the aforementioned main accept loop. In practice, I'm not sure this mattered though, as connection close and channel close events are only emitted without errors during a client initiated shutdown.