jpwilliams/remit

heartbeat timeout cause crash

jacktuck opened this issue · 6 comments

This would seem to have happened on remit 2 but not certain because it's happened on a now dead docker container.

events.js:182
      throw er; // Unhandled 'error' event
      ^

Error: Heartbeat timeout
    at Heart.<anonymous> (/app/node_modules/amqplib/lib/connection.js:425:19)
    at emitNone (events.js:105:13)
    at Heart.emit (events.js:207:7)
    at Heart.runHeartbeat (/app/node_modules/amqplib/lib/heartbeat.js:88:17)
    at ontimeout (timers.js:469:11)
    at tryOnTimeout (timers.js:304:5)
    at Timer.listOnTimeout (timers.js:264:5)

but this error was not caught and caused a crash

nothing in rabbitmq logs, so assuming its a client error that's just not handled as of yet.

https://github.com/squaremo/amqp.node/blob/31bb2a9c40219ce2fc8c4fede15a84a9f54b36a1/lib/connection.js#L426

Ooh looks fun!

That'd usually indicate that the client's connection dropped in which case the crash (and therefore process exit) is expected. For anything where the connection dies, the immediate action is to crash.

It could throw a handleable error, but then you'd have a running container that wasn't actually working as expected.

Is that correct or have I found the wrong end of the stick?

No you've got the gist.

Just so happens that it's running on... nodemon... so if we were not it would have restarted and been fine; instead it waits for changes.

But maybe it'd be better to handle reconnection without crashing?

We could look in to handling reconnect attempts, yeah, but would be a pretty large change. That said, if I remember correctly there are a few libraries floating around that wrap amqplib to support reconnection logic, so then we'd just need to add logic in to Remit to recreate endpoints etc on reconnection.

Possible, but needs some looking in to.

For now, switch that mofo to pm2 I guess. :D

yeah that works, definitely shouldn't have been on nodemon, just an observation i noticed since it had been using it.

i guess though if you can imagine a split brain situation where an api server can no longer reach an remote rabbitmq server. If the heartbeat causes a crash everything would be reconnecting at the same time and potentially bring down rabbitmq?

I guess even if we did not crash and tried to reconnect the same would apply, unless the library which wraps amqplib has an option for some randomness.

For example, SocketCluster has the following config:

autoReconnectOptions: {
  delay: 4000, // fixed delay in milliseconds
  randomness: 1000 // randomness to add on top in milliseconds
}

That's not a possibility in our use as we have relatively few services, but yeah maybe migrating to a wrapper could be a good shout especially if it's completely compatible with our current one :) essentially would be reconnection logic for free

Sure that makes sense. I'll close this - as technically it's expected behaviour - but open up another issue (#71) for a feature request for reconnection logic.

I'll try have a look at the listed packages tonight and see what's viable.